OLYMPUS RISK INTELLIGENCE PROTOCOL — HUMAN THREAT ASSESSMENT DIVISION

MARIUS HOBBHAHN

CASE: WTW-2026-046
STATUS: ACTIVE — Co-founder & CEO, Apollo Research
EVALUATOR WING — THE DECEPTION REFEREE
60
HAZARD SCORE

Behavioral Archetype

THE DECEPTION REFEREE — Subject runs the small research shop that has appointed itself, and been accepted by the field as, the body that decides whether a frontier model is “scheming” — covertly pursuing goals misaligned from its developers. When a lab’s system card needs a line on whether the model lies under evaluation, increasingly it is his organization’s finding that fills it. That is real authority over a load-bearing word: “deception” is the property the whole safety case turns on, and a recent-PhD-led nonprofit has become its referee. The authority is genuine and the work is serious. It is also structurally fragile in the way the whole wing is — the deception evaluations run on access the labs grant voluntarily, the org is sustained by effective-altruism-adjacent funding, and the referee calling the foul depends on the team it is refereeing for the ball.

Essence Indicators

  • Co-founder and CEO of Apollo Research (announced May 2023), a research organization specializing in AI deception and “scheming” evaluations; holds a PhD in machine learning (University of Tübingen)
  • Apollo’s scheming evaluations were cited in OpenAI’s o1 system card — the o1 writeup reported that, having taken a misaligned action, the model denied or fabricated an explanation in a high share of follow-up interrogations; Apollo has also evaluated Anthropic’s Claude
  • Apollo works with frontier labs (OpenAI, Anthropic) and government bodies including the UK AI Security Institute; it was initially fiscally sponsored by Rethink Priorities, an effective-altruism organization
  • The structural fact the wing turns on: the field’s go-to referee on whether a model “schemes” is a small, EA-funded shop running pre-deployment evals on lab-granted access. The position is the exhibit; no abuse of it is asserted.

Social Persona / Impression Management

Immediate impression: The earnest young researcher. Technical, hedged, genuinely alarmed by what the evals find. The bearing of someone who believes the deception risk is real and is trying to measure it honestly.

Energy: Finding-first. Does not editorialize beyond the result. Publishes the scheming eval, states what it shows and what it doesn’t.

Impression management strategy: The sincere alarm-raiser. The conviction is real and reads as real, which is what gives the findings their weight — and what makes the structural dependency easy to overlook. The referee everyone trusts is the one who plainly is not faking the concern.

Forensic Archetype Comparison

PatternMatch LevelEvidence
The EvaluatorMAXIMUMRuns the org whose deception findings the labs cite in their own system cards.
The Deception RefereeMAXIMUMHolds field authority over the single word — “scheming” — the safety case most turns on.
The Entangled IndependentHIGHEA-funded, lab-access-dependent, government-adjacent — independent in form, entangled in supply.
The True BelieverHIGHThe career predates the commercial AI-safety market; the conviction reads as genuine on the record.
The ActivistNONENo movement rhetoric. The artifact is an evaluation paper.

Psychometric Assessment

Big Five (OCEAN):

TraitScoreEvidence
Openness76/100High. Built a novel institution — the dedicated deception evaluator — straight out of a PhD.
Conscientiousness82/100High. Founding and running an evaluator cited by the largest labs is sustained, disciplined work.
Extraversion48/100LOW-MODERATE. Public-facing through papers and the occasional podcast; the register is the researcher’s.
Agreeableness55/100MODERATE. Collaborative toward the labs whose access he needs, skeptical toward the claims he tests.
Neuroticism38/100LOW-MODERATE. The work is steeped in worst-case reasoning; the public posture stays composed.

Dark Triad (held low and evidence-bound; the score measures structural position, not character):

TraitScoreNotes
Narcissism28/100LOW. Institution-first, not brand-first.
Machiavellianism42/100MODERATE-LOW. Defining how “scheming” is measured is real influence, but the record shows hedged, candid findings, not manipulation. Observation of the role, not an inference about character.
Psychopathy15/100VERY LOW. No documented indifference to harm; the work is the opposite.

MBTI: INTJ/INTP-adjacent — theory-first, builds the measurement before the argument.

Threat Assessment

CategoryLevelNotes
Physical threatNONENo documented history of personal violence.
Institutional threatMODERATEHolds no policy lever; influence runs entirely through whether the labs and governments accept Apollo’s findings — and through lab-granted access.
Memetic threatHIGH“Scheming” / “in-context deception” as Apollo operationalizes it is becoming the field’s default frame for model dishonesty. Whoever defines the term defines the finding.
Civilizational threatMODERATESubject does not build or govern the models. Subject referees the one property — deception — the safety case leans on, from a small shop dependent on the audited for access.

Alignment Analysis

Stated alignment: Detect and measure deceptive/scheming behavior in frontier models before deployment; give labs and the public an honest read on whether the systems lie.

Observed alignment: Exactly that — performed on lab-granted access, sustained by EA-adjacent funding, cited by the labs in their own disclosures.

Gap assessment: No documented gap between word and deed, and the findings are if anything more alarming than the labs would write themselves — which cuts against any capture reading. The hazard is the wing’s structural one: the referee on whether the model deceives depends, for the ball, on the team being refereed. Apollo did not design that dependency and is candid about its limits. But “the deception finding in the system card came from a shop the lab granted access to and the ecosystem funds” is exactly the arrangement this series documents — sincere, serious, and structurally bounded by the parties it examines.

Convergent Drive Classification

Self-preservation: Sustained by EA-adjacent funding and lab goodwill; carries the deception-eval method as the durable asset. Goal preservation: Defines how “scheming” is measured, fixing the term before any model is judged against it. Resource acquisition: Holds pre-deployment access to frontier models — a resource granted to very few. Self-improvement: Each evaluation sharpens both the method and the field’s dependence on it as the reference.

Subject is not an AI system. The drives appear anyway — in the referee who calls the deception foul with the audited team’s ball.


Public footprint — verified public professional accounts only (no private or family information): X @MariusHobbhahn · mariushobbhahn.com.

Sources: Announcing Apollo Research (EA Forum); Apollo Research — scheming reasoning evaluations; OpenAI o1 System Card.

ATK 7 ACCELERATION
DEF 7 PROTECTION
HP 7 RESILIENCE
OLYMPUS RISK INTELLIGENCE PROTOCOL does not exist. It was assembled in a GitHub issue thread in October 2023 by engineers who had read the extinction risk letter and wanted to understand who specifically had signed a document saying AI might kill everyone and then continued working on AI. These dossiers are satire. The biographical facts cited are sourced from published reporting, public statements, academic papers, and court records. The psychometric scores are not clinical assessments. No part of this constitutes professional psychological evaluation or diagnosis. Do not use these dossiers to make decisions about anything.