Thresholding

Jul 6, 2024

Or, "the 2.9 problem"

7 Comments

Jul 7, 2024

Man, this is why you are consistently in the same tier as Eliezer and Scott in my mind. Cleary naming an important phenomena while being compassionate and clear-headed about multiple perspectives on it in a way that makes me immediately want to tell all my friends.

Expand full comment

Sarah Nibs

Jul 6, 2024

At work, where we predict which ecommerce transactions are fraudulent and have financial incentive to be correct, we spent years being bitten by large fraud attacks where each individual transaction looked a bit suspicious but was not bad enough to block, and no assets tied all the transactions together in any obvious way. Eventually (finally) we began assessing the system as a whole and asking whether there was an elevated volume of just-under-the-threshold transactions in some segment of traffic. When there was, we lowered the threshold. For everyone. For a time.

And the reason we let fraudsters get away with it for so long is in large part because all of our systems were set up to assess One Single Transaction, plus other transactions concretely tied to it but explicitly not in any way that could set up a dangerous feedback loop, so none of our systems were set up to recognize the obvious-in-retrospect threshold attacks.

It's far simpler to narrow the scope of the problem to "assess this instance", and then your data model doesn't have a natural place to include global information and it doesn't fit through any of your nice interfaces. And you miss gigantic attacks on the system through myopia.

All that is to say (1) this for sure happens in ... "non-social"? contexts too, and (2) it can happen even if the thresholder isn't actually trying to be a thresholder. Though obviously a thresholder with *knowledge* of the rules can be a lot more efficient about it.

Expand full comment

Reply (1)

Sarah Nibs

Jul 8, 2024

> explicitly not in any way that could set up a dangerous feedback loop

I want to emphasize this. Many "obvious" interventions like "if you see a 2.9, add 0.5 to all subsequent observations!" have a severe feedback loop problem. Observing 2.9, 2.4, 1.9, 1.4, 0.9, 0.4, 0.1, 0.1, 0.1 should definitely not result in sanctions. It's not hard to avoid doing this unless you *never* check to see if your "obvious" intervention does this.

Expand full comment

Oliver

Jul 7, 2024

This reminds me of the LW post on "sum-threshold attacks" [1], where someone stays below the plausible deniability threshold on many fronts so the sum/total harm is high.

A common tactic I've seen used by predators is to decrease the "2.9 half life" by operating in multiple mutually exclusive communities. Then when they are finally outed for crossing the line, people get surprised by how many credible harassment accusations piggyback off it.

The way I've come to think about it is the more 2.9s someone pulls, the bigger my Bayes update that they're a sociopath. Past a certain point I do everything I can to ostracise them (like when Aella talks about treating frame controllers with conflict theory [2]).

I think it's very common to extend too many second chances to threshold-ers because of typical mind fallacy. Conditional on that much plausibly-deniable abuse I claim they're not just a "confused" version of you, they have antisocial personality disorder. I don't know anybody else in my circles who agrees with me on this; but I also don't know anybody else who spotted individuals X and Y were predators.

Another great example of this kind of conduct is crypto moguls who flit between jurisdictions to avoid the focused attention of the law. Patrick McKenzie has a brilliant explanation of how crypto firms managed to obviously violate securities and anti-money-laundering law for so long [3].

[1]: https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks

[2]: https://aella.substack.com/p/frame-control

[3]: https://www.bitsaboutmoney.com/archive/bond-villain-compliance-strategy/

Expand full comment

Kevin

Jul 7, 2024

Very insightful post. I have noticed this phenomena with increasing frequency as a lot of groups/organizations/collaborations/etc. have adopted "codes of conduct" in recent years. The bad actors quickly learn how to stay just below the punishable threshold, which weakens the code of conduct overall, since it becomes clear to everyone that there is rarely any enforcement against violations.

It seems like what we really need here is a continuous relaxation of a discrete system, but this is blocked by the large constant term in any judicial proceeding. "Treat the fourth 2.9 as a 6" might be the only practical approach to resolve this tension, but I wonder if we could come up with something better. (Maybe LLMs can reduce the constant term enough... an arbitration that only costs some number of FLOPs enables a much broader range of possibilities.)

Expand full comment

Saul Munn

Jul 6, 2024

(1)

i thought this was great, and i have (already!) found it quite useful as a handle. i really appreciate this writeup.

(2)

i wrote a brief summary for myself, but i figured i might as well share it here. i don't think it's _super_ high fidelity, but probably sufficient that saul_one-year-from-now will retain the important bits:

"thresholding" is a category of behavior where a malicious actor engages in behavior that's *juuust* under a punishable level, many times, over & over, adds confusion & ambiguity into the mix, and makes actually punishing their behavior much more difficult. this also has the result of a community losing faith/trust in the system/rules.

duncan proposes a few solutions/mitigations:

1) use the term. encourage others to use the term.

2) reduce the stigma of keeping track of near-violations. (better yet, consider that record-keeping virtuous.)

3) follow through with the consequences you've previously threatened. responding to threshold attacks can sometimes induce lots of (fairly reasonable) reactions of "wtf? that wasn't even that bad?!" — be prepared to respond to them.

4ish) you can add extra rules/consequences/subsystems/etc to your community's system to make it more robust to threshold attacks. e.g., "multiple near-violations will lower your threshold for 3 months" or "your 5th near-violation will be punished quite strongly."

(3)

i don't have a good sense of your orientation to posting on LW, but — i imagine LW would quite like this. how would you feel about cross-posting this to LW? if you're not interested, would you mind if i did (as a linkpost)?

Expand full comment

Reply (1)

Duncan Sabien

Jul 6, 2024

I'm not crossposting to LW myself but you're definitely welcome to. =)

Expand full comment

Homo Sabiens

Thresholding