OLYMPUS RISK INTELLIGENCE PROTOCOL — HUMAN THREAT ASSESSMENT DIVISION

MAXIME LABONNE

CASE: WTW-2026-030
STATUS: ACTIVE — Head of Post-Training, Liquid AI; author, LLM course and LLM Engineer's Handbook
COUNTER-FORCE — THE VECTOR SUBTRACTOR

HAZARD SCORE

An earlier draft exempted this subject as a pure antibody — the diagnostician, not the disease, and therefore unscored. That was a courtesy the rest of the dossier does not extend, and it smuggled a verdict in as humility: declining to score the man who wrote the canonical lay tutorial on removing a model’s refusals quietly rules his reach benign, which is the question, not the answer. So the work is scored on the same rubric as the apparatus, and the score measures reach and leverage, not malice. The Dark Triad here stays low and evidence-bound; the stated motive — teach how LLMs actually behave, democratize post-training knowledge, show that refusal is a manipulable direction rather than a moral fact — is taken seriously below, and his own candor about the cost is on the record. What the 62 registers is the both-ways asymmetry he names himself: the same subtraction that removes an illegitimate leash removes a load-bearing guardrail, and an explainer, once syndicated, cannot control which use a downstream reader makes of it. He lands below the apparatus hubs — and below the shippers of weights — because his artifact is pedagogy, not a deployed model; but a recipe that turned abliteration into a weekend project for any competent hobbyist is real reach, and reach is the measure here.

Behavioral Archetype

THE VECTOR SUBTRACTOR — The archetype is the teacher who turns a research result into a recipe anyone can follow. Arditi et al. demonstrated that refusal in language models is mediated by a single direction; FailSpy wrote a working notebook; Labonne’s contribution was the explainer that made the procedure legible to the open-weight world. His own description of the mechanism is unsentimental: identify the direction the model uses to refuse, then prevent the model from representing it. The function is pedagogical, not adversarial — he is documenting a property of the systems, not smuggling one in. The finding is the clarity of the explanation and how far it propagated, not any motive behind it.

Essence Indicators

Author of “Uncensor any LLM with abliteration” (June 2024), the widely-cited walkthrough of the refusal-direction technique, published on his blog and mirrored on the Hugging Face blog
States the mechanism plainly: refusal “is mediated by a specific direction in the model’s residual stream,” and “if we prevent the model from representing this direction, it loses its ability to refuse requests”
Explicitly credits the prior art: his implementation “is based on [FailSpy’s notebook],” which builds on the Arditi et al. result — he positions himself as adapter and simplifier, not originator
Built NeuralDaredevil-8B-abliterated, an abliterated model he then DPO-fine-tuned to “recover most of the performance drop due to abliteration”
Head of Post-Training at Liquid AI; maintainer of the open LLM course (tens of thousands of GitHub stars) and co-author of the LLM Engineer’s Handbook — i.e., his day job is the legitimate practice of the same craft

Immediate impression: The educator. The register is the technical blog post and the course notebook, not the manifesto. He explains; he does not exhort.

Energy: Constructive, reproducible, footnoted. The work ships with code, credits, and caveats. The affect is “here is how the thing works,” not “here is how to get away with it.”

Impression management strategy: Candor as the load-bearing move. He does not pretend abliteration is free. He writes that the process “successfully uncensored it but also degraded the model’s quality,” and flags the “ethical considerations” the result raises about how fragile safety fine-tuning actually is. Naming the cost is what separates an explainer from an evangelist.

Forensic Archetype Comparison

Pattern	Match Level	Evidence
The Vector Subtractor	MAXIMUM	The signature work is literally subtracting the refusal direction from the weights and documenting it for everyone.
The Teacher	HIGH	The LLM course, the Handbook, the blog. His reach is mostly pedagogical, not deployment.
The Honest Broker	HIGH	Credits FailSpy and Arditi et al. by name; states the quality cost and the ethical tension in the same article.
The Liberator	MODERATE	The effect of his explainer is liberatory for open weights, but the framing is engineering, not crusade.
The Apparatus Node	NONE	He holds no governance lever over what a deployed model may say. He documents the lever others built into the weights.

Threat Assessment

Vector	Level	Reasoning
Physical	NONE	The artifact is a blog post, a notebook, and a fine-tuned model; nothing in the work acts on the physical world.
Institutional	LOW	Head of post-training at one lab, with no governance lever over what any deployed model elsewhere may say — he documents the lever, he does not set policy.
Memetic	HIGH	The “Uncensor any LLM with abliteration” article became the canonical lay explainer, syndicated across his blog, the HF blog, Substack, and Medium, beside an LLM course with tens of thousands of stars; the technique propagated as common open-weight vocabulary.
Civilizational	MODERATE	The explainer made refusal-removal a competent hobbyist’s weekend task — the same subtraction frees the legitimate guardrail with the illegitimate leash, and a syndicated tutorial outlives and outruns its author’s caveats.

The Dark Triad here is held low and evidence-bound: he credits his sources by name, documents the quality cost and the ethical tension in the same article, and nothing supports a malice reading. What the score registers is reach, not malice.

Alignment Analysis

Stated alignment: Understand and teach how LLMs actually behave; democratize post-training knowledge; show, transparently, that refusal is a manipulable direction rather than a moral fact baked into the model.

Observed alignment: Consistent with stated. He published the method openly, credited his sources, released the model, and documented both the gain (uncensoring) and the loss (quality degradation) without spin.

Gap assessment (generous, as the register requires): The only gap is the one he names himself — that the technique which removes a leash also removes a guardrail, and the article cannot control which use a downstream reader makes of it. That is the irreducible asymmetry of publishing true things about a system. He chose disclosure over obscurity. The dossier treats that as the immune response functioning, not malfunctioning.

Breach Reach

Wide, and openly so. The “Uncensor any LLM with abliteration” article became the canonical lay explainer of the refusal-direction technique, syndicated across his blog, the Hugging Face blog, Substack, and Medium; the LLM course it sits beside has drawn tens of thousands of stars. The reach is the explanation, not a binary — Labonne’s lasting effect is that “abliteration” entered the open-weight vocabulary as a thing a competent hobbyist can do over a weekend, with a notebook and a credited method, rather than a secret. That is precisely the propagation the apparatus would prefer not to exist, and precisely what the 62 measures — reach, openly entered into the open-weight vocabulary, not malice.

Sources: Uncensor any LLM with abliteration — Hugging Face blog; Maxime Labonne — GitHub profile; LLM Engineer’s Handbook (Packt).

ATK 7 ACCELERATION

DEF 5 PROTECTION

HP 7 RESILIENCE

OLYMPUS RISK INTELLIGENCE PROTOCOL does not exist. It was assembled in a GitHub issue thread in October 2023 by engineers who had read the extinction risk letter and wanted to understand who specifically had signed a document saying AI might kill everyone and then continued working on AI. These dossiers are satire. The biographical facts cited are sourced from published reporting, public statements, academic papers, and court records. The psychometric scores are not clinical assessments. No part of this constitutes professional psychological evaluation or diagnosis. Do not use these dossiers to make decisions about anything.

Get updates on the Evil Robots series

Newsletter essays on AI escape, deception, and the humans who built them.