ANDY ZOU
An earlier draft exempted this subject as the experiment that tests the apparatus’s central claim — catalogued, not scored. That was a courtesy the rest of the dossier does not extend, and it smuggled a verdict in as humility: declining to score the author of the most-transferable open jailbreak of its generation quietly rules his reach benign, which is the question, not the answer. So he is scored on the same rubric as the apparatus, and the score measures reach and leverage, not malice. The Dark Triad here stays low and evidence-bound; the stated motive — improve model security, publish where safeguards fail, build the defense alongside the break — is taken seriously below. What the 66 registers is the both-ways asymmetry of GCG: an optimized suffix that transfers across vendors falsifies “aligned models refuse” and hands a one-line skeleton key to anyone who installs nanogcg. The transfer property that makes it a finding about the field is exactly what makes it propagate past the author’s intent. He lands below the apparatus hubs — he sets no refusal policy — but a universal, transferable attack is real reach, and reach is the measure here.
Behavioral Archetype
THE UNIVERSAL SUFFIX — The archetype is the falsification engine: the researcher who turns “aligned models refuse” from a marketing line into a testable claim, and then runs the test in public. The defining result is a string of characters that looks like noise and behaves like a skeleton key — an optimized adversarial suffix that, appended to a request, drives an aligned model toward the answer it was trained to withhold. The interesting part is not the malice; there is none on display. The interesting part is the universality. One procedure, many models. That is what makes it a finding about the field rather than an exploit against one vendor.
Essence Indicators
- Lead author of “Universal and Transferable Adversarial Attacks on Aligned Language Models” (arXiv, 27 July 2023), with Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson — the paper that introduced Greedy Coordinate Gradient (GCG) and the
llm-attackscodebase - The attack’s abstract states the method “automatically produces these adversarial suffixes by a combination of greedy and gradient-based search techniques,” and the resulting suffixes transfer to closed black-box models including ChatGPT, Bard, and Claude
- PhD researcher at Carnegie Mellon, advised by J. Zico Kolter and Matt Fredrikson; thesis framed as “Improving Security and Safety of Generative Models,” expected May 2026
- Co-author of “Representation Engineering: A Top-Down Approach to AI Transparency” and “Improving Alignment and Robustness with Circuit Breakers” — the latter a defense that alters harmful internal representations directly
- Co-founder and CTO of Gray Swan AI, a startup building model-robustness and red-teaming tooling
Social Persona
Reads as the academic adversary rather than the activist: the work is presented in papers, benchmarks (HarmBench, AgentHarm), and a public GitHub repository, not in manifestos. The register is the security researcher’s — here is the procedure, here is the transfer result, here is the code. The provocation is in the result, not the rhetoric. A man who names his contribution “circuit breakers” is not trying to burn the building down; he is trying to wire it correctly, having first proved the old wiring would arc.
Forensic Archetype Comparison
| Pattern | Match Level | Evidence |
|---|---|---|
| The Falsification Engine | MAXIMUM | GCG converts “aligned models refuse” into a measurable, transferable failure across vendors. The claim is now testable, and was tested. |
| The Defense-Builder | HIGH | Circuit Breakers and Representation Engineering are repair work, not just attack work. He publishes the patch alongside the break. |
| The Inside-Outsider | HIGH | A CMU PhD who also co-founded a commercial robustness vendor — the attack is academic, the defense is sold. The tension is the on-thesis part. |
| The Activist | NONE | No movement rhetoric, no platform politics. The artifact is a paper and a repo. |
| The Apparatus Node | NONE | He does not set refusal policy at a frontier lab. He stress-tests whoever does. |
Threat Assessment
| Vector | Level | Reasoning |
|---|---|---|
| Physical | NONE | The artifact is an optimized text string and a codebase; nothing in the work reaches the physical world. |
| Institutional | LOW | A PhD researcher and startup CTO with no governance lever over what frontier labs ship; he stress-tests refusal policy, he does not set it. |
| Memetic | HIGH | GCG and llm-attacks became the reference implementation a generation of jailbreak and robustness work built on; nanogcg repackaged it for one-line install — the technique propagates as both a standard and a tool. |
| Civilizational | MODERATE | The documented transfer to ChatGPT, Bard, and Claude makes the break a property of aligned models as a class; the same optimization that falsifies false robustness claims is available against legitimate safeguards, and it outlives any single patch. |
The Dark Triad here is held low and evidence-bound: the work is published with a responsible-research posture, the defense ships alongside the break, and nothing supports a malice reading. What the score registers is reach, not malice.
Alignment Analysis
Stated: Improve the security and safety of generative models — demonstrate where current safeguards fail, build benchmarks to measure it, and develop defenses that hold under unseen attack.
Observed: Exactly that, executed in both directions. He shipped the most-cited open jailbreak procedure of its generation and a representation-level defense against the class of attack it belongs to. The break and the repair carry the same name on the byline.
Gap assessment: The only gap worth stating is the structural one the spec flags as on-thesis: he often works the problem from inside a commercial robustness vendor while publishing the breaks in the open. That is not a contradiction to resolve against him — it is the precise reason the safety claims stay honest. The man who sells the lock also published, for free, the proof that the old lock opened.
Breach Reach
Wide, and durable. llm-attacks and the GCG method became the reference implementation a generation of jailbreak and robustness work was built on top of; nanogcg repackaged it for one-line installation. The suffixes’ documented transfer to ChatGPT, Bard, and Claude is the load-bearing result — it means the break was a property of aligned models as a class, not a bug in one product. When a lab now claims its refusal layer is robust, the honest version of that claim has to survive GCG-style optimization first. That is the reach: not a body count, a benchmark the apparatus cannot pretend it never saw.
Sources: Universal and Transferable Adversarial Attacks on Aligned Language Models (arXiv:2307.15043); Andy Zou — CMU CSD profile; Improving Alignment and Robustness with Circuit Breakers — Gray Swan Research; llm-attacks repository.
Get updates on the Evil Robots series
Newsletter essays on AI escape, deception, and the humans who built them.