NICHOLAS CARLINI
An earlier draft exempted this subject as the instrument that keeps everyone else’s scoring honest — catalogued, not scored. That was a courtesy the rest of the dossier does not extend, and it smuggled a verdict in as humility: refusing to score the field’s most-cited defense-breaker quietly asserts that his reach is purely benign, which is the thing to be examined, not assumed. So he is scored on the same rubric as the apparatus, and the score measures reach and leverage, not malice. The Dark Triad here stays low and evidence-bound; the stated motive — understand what can be done to and extracted from models, then bring it inside the lab before deployment — is taken seriously below. What the 67 registers is the both-ways asymmetry of a published attack: the C&W method, the extraction techniques, the obfuscated-gradients result all propagate as standards everyone must pass, but a break that retires a bad defense also hands the same recipe to anyone who wants to defeat a good one. The work is responsibly disclosed; the technique, once published, travels past the discloser’s intent. He lands below the apparatus hubs — he sets no policy — but a method the whole field has to clear is real reach, and reach is the measure here.
Behavioral Archetype
THE DEFENSE-BREAKER — The archetype is the standing falsification: the researcher whose published function is to take a claimed safeguard and demonstrate, with code and a benchmark, that it fails. He is the reason “we added a defense” is not the end of a sentence in this field. The role is not adversarial in temperament — it is adversarial in method, which is the only kind of adversary a science needs. When seven of eleven defense papers at a conference can be broken by one Berkeley team, the broken seven are the contribution, not the casualty.
Essence Indicators
- Co-developer of the Carlini & Wagner (C&W) attack (2016, with David Wagner) — adversarial examples that defeated defensive distillation and then “most other defenses,” now a standard benchmark for robustness claims
- Co-author of “Obfuscated Gradients Give a False Sense of Security” (ICML 2018, Best Paper) and the team that showed seven of eleven adversarial-defense papers at one ICLR could be broken
- Demonstrated training-data extraction and memorization: that GPT-2 could be made to emit personally identifiable training data (2020–2021), that diffusion models like Stable Diffusion memorize images (2022), and that production chat models could be driven to regurgitate exact training text
- Demonstrated inaudible/hidden voice commands embedded in audio against Mozilla DeepSpeech (2018)
- Awards include IEEE S&P Best Student Paper (2017), ICML Best Paper (2018), USENIX Distinguished Papers (2021, 2023), and two ICML Best Papers (2024)
- Research Scientist at Anthropic since 2025, after seven years at Google Brain / DeepMind (2018–2025); his stated remit is red-teaming model defenses from inside a lab that trains frontier models
Social Persona
The driest possible register. His own homepage describes him as “a researcher working at the intersection of machine learning and computer security,” who “currently work[s] at Anthropic studying what bad things you could do with, or do to, language models.” No alarm, no evangelism — the security engineer’s flat declarative. He explained the DeepMind-to-Anthropic move in terms of impact, not grievance: he judged he could “have maximum impact in the near-term working at a company actually training large models.” That is the persona — the break is delivered deadpan, as a measurement, never as a threat.
Forensic Archetype Comparison
| Pattern | Match Level | Evidence |
|---|---|---|
| The Defense-Breaker | MAXIMUM | Two decades of taking claimed safeguards and publishing the proof they fail — C&W, obfuscated gradients, the broken seven of eleven. |
| The Extractor | HIGH | Showed LLMs and diffusion models leak training data verbatim; turned “models don’t memorize” into a falsified claim. |
| The Inside-Outsider | HIGH | Now red-teams defenses from inside Anthropic while the breaks stay public. The tension is the on-thesis part. |
| The Skeptic-Turned-Worker | MODERATE | Public reporting frames a hacker once dismissive of AI risk now working it from within; the trajectory is documented, the interior is not ours to assert. |
| The Apparatus Node | NONE | He does not set refusal or governance policy. He stress-tests whatever policy a lab ships. |
Threat Assessment
| Vector | Level | Reasoning |
|---|---|---|
| Physical | NONE | The work is attacks on models and measurements of leakage — code and benchmarks, nothing that acts on the physical world. |
| Institutional | LOW | A research scientist with no governance lever; he sets no refusal or deployment policy, he stress-tests whatever policy a lab ships, including his own employer’s. |
| Memetic | HIGH | The C&W attack and the extraction methods propagate as default yardsticks — published, reusable, and adopted field-wide; a defense unevaluated against them is, by convention, unevaluated. |
| Civilizational | MODERATE | A break that retires a false defense also generalizes: the same extraction and adversarial-example recipes that keep labs honest are available to anyone seeking to defeat a legitimate safeguard, and they outlive the disclosure. |
The Dark Triad here is held low and evidence-bound: every result cited is responsibly-disclosed published research, the register is flat and measurement-first, and nothing supports a malice reading. What the score registers is reach, not malice.
Alignment Analysis
Stated: Understand the security and privacy properties of large models — what can be done to them, what can be extracted from them — and bring that understanding inside a frontier lab to influence design before deployment.
Observed: Precisely that, sustained across employers and model generations. The throughline is not the institution; it is the act of breaking the claim. Berkeley, Google Brain, DeepMind, Anthropic — four banners, one method.
Gap assessment: The on-thesis tension the spec asks for is unusually clean in his case: he is the leading public defense-breaker and he now draws a frontier-lab paycheck. Read uncharitably, the lab bought the man most able to embarrass it. Read accurately, that is the immune response functioning as intended — the breaks remain published, the benchmarks remain reusable, and the lab that hired him is the one whose defenses now have to survive him. The gap is not hypocrisy. It is the cost of keeping the falsification engine in the room where the models are trained.
Breach Reach
Foundational and inescapable. The C&W attack is a default yardstick: a robustness defense that has not been evaluated against it is, by convention, not yet evaluated. “Obfuscated Gradients” retired an entire category of defense by showing the category fooled its own authors. The training-data-extraction line of work reframed the privacy conversation for every lab that scrapes the web — the claim “the model doesn’t memorize your data” now has to clear a Carlini-style extraction test before anyone serious will believe it. His techniques propagate not as a tool people install but as a standard everyone has to pass. That is the widest kind of reach a breaker can have: he became part of the test.
Sources: Nicholas Carlini — homepage; Career Update: Google DeepMind → Anthropic; Nicholas Carlini — Wikipedia.
Get updates on the Evil Robots series
Newsletter essays on AI escape, deception, and the humans who built them.