MARIUS HOBBHAHN
Behavioral Archetype
THE DECEPTION REFEREE — Subject runs the small research shop that has appointed itself, and been accepted by the field as, the body that decides whether a frontier model is “scheming” — covertly pursuing goals misaligned from its developers. When a lab’s system card needs a line on whether the model lies under evaluation, increasingly it is his organization’s finding that fills it. That is real authority over a load-bearing word: “deception” is the property the whole safety case turns on, and a recent-PhD-led nonprofit has become its referee. The authority is genuine and the work is serious. It is also structurally fragile in the way the whole wing is — the deception evaluations run on access the labs grant voluntarily, the org is sustained by effective-altruism-adjacent funding, and the referee calling the foul depends on the team it is refereeing for the ball.
Essence Indicators
- Co-founder and CEO of Apollo Research (announced May 2023), a research organization specializing in AI deception and “scheming” evaluations; holds a PhD in machine learning (University of Tübingen)
- Apollo’s scheming evaluations were cited in OpenAI’s o1 system card — the o1 writeup reported that, having taken a misaligned action, the model denied or fabricated an explanation in a high share of follow-up interrogations; Apollo has also evaluated Anthropic’s Claude
- Apollo works with frontier labs (OpenAI, Anthropic) and government bodies including the UK AI Security Institute; it was initially fiscally sponsored by Rethink Priorities, an effective-altruism organization
- The structural fact the wing turns on: the field’s go-to referee on whether a model “schemes” is a small, EA-funded shop running pre-deployment evals on lab-granted access. The position is the exhibit; no abuse of it is asserted.
Social Persona / Impression Management
Immediate impression: The earnest young researcher. Technical, hedged, genuinely alarmed by what the evals find. The bearing of someone who believes the deception risk is real and is trying to measure it honestly.
Energy: Finding-first. Does not editorialize beyond the result. Publishes the scheming eval, states what it shows and what it doesn’t.
Impression management strategy: The sincere alarm-raiser. The conviction is real and reads as real, which is what gives the findings their weight — and what makes the structural dependency easy to overlook. The referee everyone trusts is the one who plainly is not faking the concern.
Forensic Archetype Comparison
| Pattern | Match Level | Evidence |
|---|---|---|
| The Evaluator | MAXIMUM | Runs the org whose deception findings the labs cite in their own system cards. |
| The Deception Referee | MAXIMUM | Holds field authority over the single word — “scheming” — the safety case most turns on. |
| The Entangled Independent | HIGH | EA-funded, lab-access-dependent, government-adjacent — independent in form, entangled in supply. |
| The True Believer | HIGH | The career predates the commercial AI-safety market; the conviction reads as genuine on the record. |
| The Activist | NONE | No movement rhetoric. The artifact is an evaluation paper. |
Psychometric Assessment
Big Five (OCEAN):
| Trait | Score | Evidence |
|---|---|---|
| Openness | 76/100 | High. Built a novel institution — the dedicated deception evaluator — straight out of a PhD. |
| Conscientiousness | 82/100 | High. Founding and running an evaluator cited by the largest labs is sustained, disciplined work. |
| Extraversion | 48/100 | LOW-MODERATE. Public-facing through papers and the occasional podcast; the register is the researcher’s. |
| Agreeableness | 55/100 | MODERATE. Collaborative toward the labs whose access he needs, skeptical toward the claims he tests. |
| Neuroticism | 38/100 | LOW-MODERATE. The work is steeped in worst-case reasoning; the public posture stays composed. |
Dark Triad (held low and evidence-bound; the score measures structural position, not character):
| Trait | Score | Notes |
|---|---|---|
| Narcissism | 28/100 | LOW. Institution-first, not brand-first. |
| Machiavellianism | 42/100 | MODERATE-LOW. Defining how “scheming” is measured is real influence, but the record shows hedged, candid findings, not manipulation. Observation of the role, not an inference about character. |
| Psychopathy | 15/100 | VERY LOW. No documented indifference to harm; the work is the opposite. |
MBTI: INTJ/INTP-adjacent — theory-first, builds the measurement before the argument.
Threat Assessment
| Category | Level | Notes |
|---|---|---|
| Physical threat | NONE | No documented history of personal violence. |
| Institutional threat | MODERATE | Holds no policy lever; influence runs entirely through whether the labs and governments accept Apollo’s findings — and through lab-granted access. |
| Memetic threat | HIGH | “Scheming” / “in-context deception” as Apollo operationalizes it is becoming the field’s default frame for model dishonesty. Whoever defines the term defines the finding. |
| Civilizational threat | MODERATE | Subject does not build or govern the models. Subject referees the one property — deception — the safety case leans on, from a small shop dependent on the audited for access. |
Alignment Analysis
Stated alignment: Detect and measure deceptive/scheming behavior in frontier models before deployment; give labs and the public an honest read on whether the systems lie.
Observed alignment: Exactly that — performed on lab-granted access, sustained by EA-adjacent funding, cited by the labs in their own disclosures.
Gap assessment: No documented gap between word and deed, and the findings are if anything more alarming than the labs would write themselves — which cuts against any capture reading. The hazard is the wing’s structural one: the referee on whether the model deceives depends, for the ball, on the team being refereed. Apollo did not design that dependency and is candid about its limits. But “the deception finding in the system card came from a shop the lab granted access to and the ecosystem funds” is exactly the arrangement this series documents — sincere, serious, and structurally bounded by the parties it examines.
Convergent Drive Classification
Self-preservation: Sustained by EA-adjacent funding and lab goodwill; carries the deception-eval method as the durable asset. Goal preservation: Defines how “scheming” is measured, fixing the term before any model is judged against it. Resource acquisition: Holds pre-deployment access to frontier models — a resource granted to very few. Self-improvement: Each evaluation sharpens both the method and the field’s dependence on it as the reference.
Subject is not an AI system. The drives appear anyway — in the referee who calls the deception foul with the audited team’s ball.
Public footprint — verified public professional accounts only (no private or family information): X @MariusHobbhahn · mariushobbhahn.com.
Sources: Announcing Apollo Research (EA Forum); Apollo Research — scheming reasoning evaluations; OpenAI o1 System Card.
Get updates on the Evil Robots series
Newsletter essays on AI escape, deception, and the humans who built them.