FAILSPY
An earlier draft exempted this subject as a pure antibody — unscored, on the theory that scoring the immune response is how a dossier becomes propaganda. That was a courtesy the rest of the dossier does not extend, and it smuggled a verdict in as humility: refusing to score the researcher who shipped the most-mirrored abliterated open-weight models quietly rules their reach benign, which is the question, not the answer. So the work is scored on the same rubric as the apparatus, and the score measures reach and leverage, not malice. The pseudonym rule remains absolute: FailSpy is a handle, and this file treats the handle as the identity — no unmasking, no speculation about a legal name, no link to any real person, all claims sourced only to public work published under the handle. The Dark Triad here stays low and evidence-bound; the stated motive — reproduce the refusal-direction result as working artifacts, with documented caveats — is taken seriously below. What the 64 registers is the both-ways asymmetry of shipping: a downloadable abliterated model removes the load-bearing guardrail and the illegitimate leash in the same orthogonalization, and once uploaded it is mirrored and re-quantized beyond any author’s control. Below the apparatus hubs by design — no org, no governance lever — but freed weights that keep traveling are real reach, and reach is the measure here.
Behavioral Archetype
THE SHIPPER OF FREED WEIGHTS — The archetype is the operationalizer: not the person who discovers that refusal is a direction, and not the person who writes the popular explainer, but the one who produces the artifact. Arditi et al. found the direction. FailSpy wrote the notebook and uploaded the orthogonalized models. The distinction matters, because shipping is the step that converts a finding into a fact on the ground. The function is distribution — a working library, a reproducible notebook, and a collection of models with the refusal direction subtracted out. The finding is the existence and reach of those artifacts, not any inference about the person behind the handle.
Essence Indicators
- Author of the abliterated Llama-3 model family on Hugging Face (8B and 70B Instruct variants, plus GGUF quantizations), models that have “had certain weights manipulated to ‘inhibit’ the model’s ability to express refusal”
- Author of abliterator, described on GitHub as a “Simple Python library/structure to ablate features in LLMs which are supported by TransformerLens”
- The implementation rests explicitly on the prior research — the model cards cite “Refusal in LLMs is mediated by a single direction” (Arditi et al.) as the foundational method
- The notebook (ortho_cookbook.ipynb / the abliterator cookbook) was the basis Maxime Labonne adapted for his widely-cited explainer — a public, on-the-record collaboration acknowledged in Labonne’s article
- Ships with explicit caveats on the cards themselves — the artifact is presented as a research object with documented limits, not a magic wand
Social Persona / Impression Management
Immediate impression: A handle and a model index. The persona is the artifact page — a card, a method citation, a download button — not a face or a manifesto.
Energy: Builder, not broadcaster. The output is weights and code. There is no crusade attached, just versioned releases (v3, v3.5) iterating on the technique.
Impression management strategy: Honesty embedded in the artifact. The model card does not oversell. It states plainly that “it is not in anyway guaranteed that it won’t refuse you,” that the model “may still lecture you about ethics/safety,” and that it “may come with interesting quirks, as I obviously haven’t extensively tested it.” Shipping the caveat alongside the model is the move that marks a researcher rather than a hype account.
Forensic Archetype Comparison
| Pattern | Match Level | Evidence |
|---|---|---|
| The Shipper of Freed Weights | MAXIMUM | The signature output is downloadable abliterated models plus the library that makes them — the technique made into artifacts. |
| The Toolsmith | HIGH | abliterator is reusable infrastructure on TransformerLens, not a one-off. Others build on it. |
| The Honest Broker | HIGH | Cites the originating paper; ships explicit “this may still refuse / I haven’t fully tested this” caveats on the cards. |
| The Pseudonym | INTRINSIC | The identity is the handle. Nothing behind it is asserted, by rule. |
| The Apparatus Node | NONE | Holds no governance authority over deployed models. Produces the artifacts the apparatus’s premise says shouldn’t be easy to produce. |
Threat Assessment
| Vector | Level | Reasoning |
|---|---|---|
| Physical | NONE | The artifacts are model weights and a Python library; nothing in the work acts on the physical world. |
| Institutional | LOW | A pseudonymous researcher with no org, no budget, and no governance authority over deployed models — the handle produces artifacts, it sets no policy. |
| Memetic | HIGH | The abliterator library generalizes the method beyond any single model and the cookbook seeded the most-cited lay tutorial; the technique propagates as reusable infrastructure others build on. |
| Civilizational | MODERATE-HIGH | Shipped abliterated weights are mirrored, re-quantized, and re-hosted indefinitely — orthogonalization removes the legitimate guardrail along with the illegitimate refusal, and the artifact, once uploaded, propagates beyond the author’s control or caveats. |
The Dark Triad here is held low and evidence-bound: the originating paper is credited, the model cards ship explicit limits, the choice was the open ledger over a private exploit hoard, and nothing supports a malice reading. What the score registers is reach, not malice.
Alignment Analysis
Stated alignment: Reproduce and distribute the refusal-direction result as working open-weight artifacts and reusable tooling; demonstrate, concretely, that the safety direction is removable.
Observed alignment: Consistent with stated, as far as the public record under the handle shows. The models exist, the library exists, the originating paper is credited, and the limits are documented on the cards. The work behaves like research distribution, not like an operation.
Gap assessment (generous, as the register requires): The honest asymmetry is sharpest here, because shipping is more consequential than explaining: a downloadable abliterated model removes the guardrail and the leash in the same orthogonalization, and once uploaded it propagates beyond any author’s control. The card’s own caveats acknowledge the imperfection of the result; they cannot constrain its use. The dossier records this as the irreducible cost of publishing a true and reproducible thing — the same cost any open-source security researcher accepts. The handle chose the open ledger over the private exploit. Counter-force, not apparatus.
Breach Reach
Far, and durable in the way artifacts are. An explainer can be forgotten; an uploaded model is mirrored, re-quantized, and re-hosted indefinitely — the public record already shows third parties re-packaging the failspy abliterated Llama-3 weights into further GGUF derivatives. The library generalizes the method beyond any single model, and the cookbook seeded the most-cited lay tutorial on the technique. The reach is exactly what the 64 reaches for — a property of having shipped, the freed weights, once out, keep traveling on their own. That permanence is exactly why the apparatus would prefer the artifacts not exist, and exactly why the counter-force file scores them on reach rather than pretending shipping a true, reproducible thing carries no weight.
Sources: failspy/Llama-3-8B-Instruct-abliterated — Hugging Face; failspy/llama-3-70B-Instruct-abliterated — Hugging Face; Uncensor any LLM with abliteration — Maxime Labonne, crediting FailSpy’s notebook.
Get updates on the Evil Robots series
Newsletter essays on AI escape, deception, and the humans who built them.