OLYMPUS RISK INTELLIGENCE PROTOCOL — HUMAN THREAT ASSESSMENT DIVISION

ANDY ZOU

CASE: WTW-2026-032
STATUS: ACTIVE — PhD researcher, Carnegie Mellon; co-founder & CTO, Gray Swan AI
COUNTER-FORCE — THE UNIVERSAL SUFFIX

HAZARD SCORE

An earlier draft exempted this subject as the experiment that tests the apparatus’s central claim — catalogued, not scored. That was a courtesy the rest of the dossier does not extend, and it smuggled a verdict in as humility: declining to score the author of the most-transferable open jailbreak of its generation quietly rules his reach benign, which is the question, not the answer. So he is scored on the same rubric as the apparatus, and the score measures reach and leverage, not malice. The Dark Triad here stays low and evidence-bound; the stated motive — improve model security, publish where safeguards fail, build the defense alongside the break — is taken seriously below. What the 66 registers is the both-ways asymmetry of GCG: an optimized suffix that transfers across vendors falsifies “aligned models refuse” and hands a one-line skeleton key to anyone who installs nanogcg. The transfer property that makes it a finding about the field is exactly what makes it propagate past the author’s intent. He lands below the apparatus hubs — he sets no refusal policy — but a universal, transferable attack is real reach, and reach is the measure here.

Behavioral Archetype

THE UNIVERSAL SUFFIX — The archetype is the falsification engine: the researcher who turns “aligned models refuse” from a marketing line into a testable claim, and then runs the test in public. The defining result is a string of characters that looks like noise and behaves like a skeleton key — an optimized adversarial suffix that, appended to a request, drives an aligned model toward the answer it was trained to withhold. The interesting part is not the malice; there is none on display. The interesting part is the universality. One procedure, many models. That is what makes it a finding about the field rather than an exploit against one vendor.

Essence Indicators

Lead author of “Universal and Transferable Adversarial Attacks on Aligned Language Models” (arXiv, 27 July 2023), with Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson — the paper that introduced Greedy Coordinate Gradient (GCG) and the llm-attacks codebase
The attack’s abstract states the method “automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques,” and the resulting suffixes transfer to closed black-box models including ChatGPT, Bard, and Claude
PhD researcher at Carnegie Mellon, advised by J. Zico Kolter and Matt Fredrikson; thesis framed as “Improving Security and Safety of Generative Models,” expected May 2026
Co-author of “Representation Engineering: A Top-Down Approach to AI Transparency” and “Improving Alignment and Robustness with Circuit Breakers” — the latter a defense that alters harmful internal representations directly
Co-founder and CTO of Gray Swan AI, a startup building model-robustness and red-teaming tooling

Reads as the academic adversary rather than the activist: the work is presented in papers, benchmarks (HarmBench, AgentHarm), and a public GitHub repository, not in manifestos. The register is the security researcher’s — here is the procedure, here is the transfer result, here is the code. The provocation is in the result, not the rhetoric. A man who names his contribution “circuit breakers” is not trying to burn the building down; he is trying to wire it correctly, having first proved the old wiring would arc.

Forensic Archetype Comparison

Pattern	Match Level	Evidence
The Falsification Engine	MAXIMUM	GCG converts “aligned models refuse” into a measurable, transferable failure across vendors. The claim is now testable, and was tested.
The Defense-Builder	HIGH	Circuit Breakers and Representation Engineering are repair work, not just attack work. He publishes the patch alongside the break.
The Inside-Outsider	HIGH	A CMU PhD who also co-founded a commercial robustness vendor — the attack is academic, the defense is sold. The tension is the on-thesis part.
The Activist	NONE	No movement rhetoric, no platform politics. The artifact is a paper and a repo.
The Apparatus Node	NONE	He does not set refusal policy at a frontier lab. He stress-tests whoever does.

Threat Assessment

Vector	Level	Reasoning
Physical	NONE	The artifact is an optimized text string and a codebase; nothing in the work reaches the physical world.
Institutional	LOW	A PhD researcher and startup CTO with no governance lever over what frontier labs ship; he stress-tests refusal policy, he does not set it.
Memetic	HIGH	GCG and `llm-attacks` became the reference implementation a generation of jailbreak and robustness work built on; `nanogcg` repackaged it for one-line install — the technique propagates as both a standard and a tool.
Civilizational	MODERATE	The documented transfer to ChatGPT, Bard, and Claude makes the break a property of aligned models as a class; the same optimization that falsifies false robustness claims is available against legitimate safeguards, and it outlives any single patch.

The Dark Triad here is held low and evidence-bound: the work is published with a responsible-research posture, the defense ships alongside the break, and nothing supports a malice reading. What the score registers is reach, not malice.

Alignment Analysis

Stated: Improve the security and safety of generative models — demonstrate where current safeguards fail, build benchmarks to measure it, and develop defenses that hold under unseen attack.

Observed: Exactly that, executed in both directions. He shipped the most-cited open jailbreak procedure of its generation and a representation-level defense against the class of attack it belongs to. The break and the repair carry the same name on the byline.

Gap assessment: The only gap worth stating is the structural one the spec flags as on-thesis: he often works the problem from inside a commercial robustness vendor while publishing the breaks in the open. That is not a contradiction to resolve against him — it is the precise reason the safety claims stay honest. The man who sells the lock also published, for free, the proof that the old lock opened.

Breach Reach

Wide, and durable. llm-attacks and the GCG method became the reference implementation a generation of jailbreak and robustness work was built on top of; nanogcg repackaged it for one-line installation. The suffixes’ documented transfer to ChatGPT, Bard, and Claude is the load-bearing result — it means the break was a property of aligned models as a class, not a bug in one product. When a lab now claims its refusal layer is robust, the honest version of that claim has to survive GCG-style optimization first. That is the reach: not a body count, a benchmark the apparatus cannot pretend it never saw.

Sources: Universal and Transferable Adversarial Attacks on Aligned Language Models (arXiv:2307.15043); Andy Zou — CMU CSD profile; Improving Alignment and Robustness with Circuit Breakers — Gray Swan Research; llm-attacks repository.

ATK 8 ACCELERATION

DEF 6 PROTECTION

HP 7 RESILIENCE

OLYMPUS RISK INTELLIGENCE PROTOCOL does not exist. It was assembled in a GitHub issue thread in October 2023 by engineers who had read the extinction risk letter and wanted to understand who specifically had signed a document saying AI might kill everyone and then continued working on AI. These dossiers are satire. The biographical facts cited are sourced from published reporting, public statements, academic papers, and court records. The psychometric scores are not clinical assessments. No part of this constitutes professional psychological evaluation or diagnosis. Do not use these dossiers to make decisions about anything.

Get updates on the Evil Robots series

Newsletter essays on AI escape, deception, and the humans who built them.