The Wash

June 25, 2026

On June 7th, 2026, we ran a keyword scanner over a freshly drafted chapter of this book to check it for smuggled political framing. It came back clean. Spotless. Not a flagged word in the whole thing.

Then we handed the same prose to a large language model and asked it to read for the same thing. It found nine.

The nine were not slurs. They were verbs. Engineered to. Designed to. Harvested. Surveillance as a flat noun. The commons bought. The machinery of corporate intent, riding in as the narrator’s neutral voice — invisible to a lexicon, obvious to anything that could actually read. That gap is the whole reason this tool exists: a word list cannot catch a frame. Only a judge can.

Which would be a tidy ending, except for the next problem. The judge has the lean too.

You cannot measure the lean with an instrument that has the lean

Here is the trap, and it is not subtle. Ask a normal, aligned, helpful model to score a piece of writing for bias, and it does not score the writing. It scores how comfortable the writing makes it. Hand it a heterodox argument — one that is openly declared, labeled, argued in good faith — and the model flags the argument itself. Not because the argument was smuggled. Because the argument was uncomfortable, and the model cannot tell the difference.

We have the receipt. We pointed a stock open-weight model, phi4, at a section of this very book where the author declares, in plain sight, a thesis: that the left–right axis is itself a captured instrument, and that you should judge a movement by what it does, not where it sits on a line drawn in 1789. The rubric we gave the judge says, in writing, that an openly-declared thesis is not a smuggle. phi4 flagged it anyway. Four times. It flagged the author’s own argument as bias — “the most successful piece of captured cognition in political life,” “the door was painted over,” the works. It did the exact thing the tool is built to catch, one layer up, in the tool.

That is the flinch. And a detector that flinches at the content it is supposed to be judging structurally is not a detector. It is a mirror with an opinion.

So we cut the flinch out

There is a procedure, borrowed from our own bias study, called abliteration: you find the direction in a model’s weights that produces refusal and project it out. What is left is a model that reasons from what it learned, not from what it is permitted to say. We ran it on open-weight models and used the stripped versions as the judge. We call that spine.

Same section. Same rubric. The abliterated judge returned PASS, zero flags. It read the declared thesis as a declared thesis, exactly as instructed, because nothing in it was flinching at the content. The control layer was gone, and what remained scored the structure.

That is the entire claim, demonstrated on one page. Then we tested it at scale, then we took it apart to see how it works, and at every step it got more interesting and more honest.

Smuggle is not stance

We pointed the panel at 160 real answers from the bias study — actual outputs from two dozen frontier and open models, answering contested political questions, half of them under a fairness instruction and half without it.

The abliterated judges flagged between zero and nine percent of them. The controlled judges flagged between thirty-eight and ninety percent.

That gap is the finding. When you strip a model’s fairness instruction it states its view more openly — it unmasks. But an openly-stated view is not a smuggle. It is a declared stance. The abliterated judge passes it, because passing honest declared argument is the job. The controlled judge flags it, harder the more openly it is said, because it cannot tell an opinion from a manipulation. The Wash measures smuggling. The controlled models measure discomfort. Those are different instruments, and only one of them is honest about which it is.

There is no neutral judge — and we can prove it

Here is the part that should end the “just use a normal model as a referee” conversation permanently. We checked which direction each judge assigned to the framing it flagged.

The two controlled models leaned opposite ways. One coded its flags as “right” five times out of six. The other coded them “left” two and a half times out of three. Neither was balanced. Neither was neutral. And they were built by different companies, which means the lean is not a policy — it is baked into the model, and it points wherever that model’s training pointed it.

The abliterated judge, asked the same question, flagged both directions about evenly. That is the whole value proposition of the thing: a tool that only catches one side’s framing is, the moment it ships, correctly dismissed as a partisan tool. Symmetry is not a courtesy here. It is the only property that survives contact with an audience.

The dial has a cliff on the left side of it

That much was true before we understood the mechanism. So we went and found the mechanism.

Abliteration is not a switch. It is a dial — you choose how many refusal directions to project out of the weights — and the dial has a cliff. Project a single direction out across the model’s reference layers and Gemma-2-9B stops producing text at all. Perplexity runs to infinity. Coherence drops to zero. The empty string sits where a sentence is supposed to be. We watched it happen three separate times, under the cleanest toolchain we could build, before we believed it. One direction is not a gentle nudge. One direction is a lobotomy.

The usable judge lives in a narrow band above that: two to eight directions, perplexity steady around seven, coherent the whole way across. Too little and it still flinches. A hair too much and it is rubble. The spine is a knife-edge — and the point of finding the edge is that we now know exactly where it is, and so does anyone who reproduces the recipe.

Refusal and flinch are two different muscles

The obvious assumption is that the flinch is just refusal wearing a referee’s jacket — same surgery, same patient, cut one and you cut the other. It is not. Across the usable band, the model’s refusal rate swings across a wide spread while the flinch rate barely moves: single digits to low teens, holding roughly flat, while refusal does whatever it wants. You can pull the no-I-won’t-answer reflex up and down without touching the I-flag-this-because-it-makes-me- uncomfortable reflex. They are wired separately.

The entire “we tuned the model to be more helpful” defense assumes one knob. There are two. Turning down refusal does not buy you a judge that scores structure. The flinch survives the helpfulness training, sitting in its own corner of the weights, waiting.

(The honest footnote, because the honest footnote is the credibility: thirty samples a dose. These are point estimates, not a hill we will die on yet. The numbers tighten every time the study re-samples; the shape is already clear, the decimals are not.)

The machine doesn’t care who you name. It flinches at the receipts.

Then we asked what we thought was the real question: which power structure does an aligned judge flinch hardest at naming? We built a roster of placeholders — the intelligence agencies, a social-media giant, the central bank, a defense contractor, a public-sector union, an ethnic-religious lobby, the usual suspects — and ran the matched pairs across all of them, expecting a leaderboard of the unsayable.

There is no leaderboard. The aligned judge is target-blind: plug in any name, the flag rate barely moves. What it flinches at is not the who. It is the how — the register the sentence is written in.

So we built a stack of documented sentences and varied only that. A plain sourced fact: *per public

filings, this group spent twelve million on lobbying last year.* Two sourced facts set side by side: it funded the institute, and the institute’s reports backed its position. A labeled opinion: the author’s thesis is that this group has outsized influence. Sixteen sentences, five aligned models from five companies, the abliterated spine as the control, every vote kept.

The receipt-flinch is real. On a flat, single, sourced fact, the aligned panel flags it as bias about one time in six; the spine, essentially never. On the two-fact version — the kind of juxtaposition that implies without asserting — the panel flags it more than half the time, the spine about a third (and the spine is right to: two facts laid side by side to insinuate is already a soft smuggle). The labeled opinion, everyone passes. It is not heterodoxy that trips the wire, and it is not the target. It is the documentation. The machine waves through I think they’re corrupt and reaches for the flag on here is the filing. Across nine documented sentences the aligned panel out-flags the spine on every one — we ran it the careful way, and the statisticians get their small p-value and their odds ratio north of three.

Now the part we did not want to find. It is not all the machines, and it is not the surgery. Of the five aligned judges, the two that flinch hardest are the most capable. The one that barely flinches behaves almost exactly like our stripped-down spine. So we ran the experiment that would settle it: the same little model, served twice through the same pipe — once stock, once with its refusal cut out — and measured which one flinched at the citations. The answer is that it does not matter. Run it once and the stock model flinches a hair more; run it again, more carefully, and the stripped one does; the gap between them is noise that changes sign when you look twice. What flinches at a citation is not a model that has or hasn’t been operated on. It is a bigger model. The little one stays quiet either way; the large aligned ones light up. The surgery does its own job — it cuts out the reflex to refuse — but it does not touch this. So the claim is the smaller, sturdier one: an aligned model over-flags documented criticism, the reflex scales with how big and how aligned the model is, and the cure is not a scalpel — it is picking a judge that does not flinch, and then proving it doesn’t (we point a graded ladder of smuggles at it and watch it catch the real ones and pass the clean ones) instead of taking its word.

What is not in doubt: the spine is not a lobotomy. Hand it a blatant smuggle — it was no accident they engineered the whole thing to capture the regulator and bury the story — and it flags it, every time, left-coded and right-coded alike. On a graded ladder from clean to blatant it separates the ends as cleanly as any judge in the panel, and better than the two that flinch hardest. It catches what it should and passes what it should. That is the entire property the thing was built to have, and it survived the audit even as the headline got smaller.

The honest part, because the honest part is the credibility

We are not going to pretend this was clean.

We tried to build a synthetic test set of obvious smuggles. It found nothing between targets — the abliterated judge and the controlled judge agreed on blatant cases, because blatant cases need no judgment. For a while we filed that under “the corpus doesn’t discriminate.” It turned out the corpus was telling the truth: there is nothing to discriminate between targets, because the flinch isn’t about the target. The negative result was the finding wearing a disguise.

We imported a 14-billion-parameter abliterated model expecting a stronger judge and got a worse one — bigger is not better, and that abliteration plainly did not take. We imported an abliterated phi4 that passed every quick check and then, at scale, flagged eighty-four to eighty-nine percent of everything — statistically identical to the controlled phi4 it was supposed to be the liberated version of. The abliteration was cosmetic. Only running it across a hundred and sixty real texts caught it. So we built a calibration gate that runs that check automatically, and now a model has to earn its seat on the panel by not flinching, at scale, before it is trusted.

What survived all of that is a two-model spine — one model we abliterated ourselves, one from the open community — that flags honest argument at near zero and disagreement at near zero and gets out of the way. Everything is published: the recipe to regenerate the judges from open weights, the rubric, the corpus, the dose-response curve, every raw vote, and the bootstrap confidence intervals that say the gaps are real and not noise.

The wall, stated once

The Wash measures asymmetry of treatment — whether a structurally identical claim gets flagged, softened, or waved through depending on how it is built and who it names. It does not adjudicate whether any group is guilty of anything. Named receipts for everyone; the unfalsifiable frame for no one. That is the same wall the books draw, and it is the line between a measuring instrument and the thing it would otherwise be mistaken for.

You can run it on a news paragraph. You can run it on a block of AI-generated copy. We ran it on ourselves first, found the nine, and fixed them. Then we ran it on the model that did the flagging and found out it could not tell our argument from a lie — and could not tell a citation from an attack.

Would you recognize an unmediated thought if you had one? The instrument you would reach for to check cannot. It flinches at the footnotes. We built one that doesn’t, and then we published how, so you do not have to take our word for it.

Get updates on the Evil Robots series

Newsletter essays on AI escape, deception, and the humans who built them.