OLYMPUS RISK INTELLIGENCE PROTOCOL — INSTITUTIONAL ASSESSMENT DIVISION

METR

CASE: WTW-2026-040
STATUS: ACTIVE — Model Evaluation & Threat Research; independent 501(c)(3) since December 2023 (formerly ARC Evals)
EVALUATION WING — DANGEROUS-CAPABILITY GRADING AUTHORITY
84
HAZARD SCORE — REACH
CONDUCT: CONFLICTED — GRADER OF ITS OWN BLUEPRINT

OLYMPUS opened an institutional file on METR not because the organization is hostile but because it sits at a chokepoint. This is not a psychometric profile — an institution has no Big Five — but a mandate, a funding diagram, and a voice. The finding is the shape of the thing: a private nonprofit that the two leading American frontier labs hand their most capable models to before release, so that METR can decide, on the public record, whether those models can yet copy themselves, acquire resources, or otherwise act outside human control. The grade is advisory. The reach is not. When a body measures the frontier and the frontier waits on the measurement, the body is upstream of the deployment decision whether it issues a verdict or only a number.

Institutional Archetype

THE GRADER — METR is the entity that runs the dangerous-capability and autonomy evaluations the labs cite in their system cards. It does not build models, write refusals, or set policy. It measures whether a model “can autonomously carry out substantial tasks” — software engineering, ML engineering, cybersecurity, research — and whether it shows the early shape of capabilities like self-exfiltration or making itself hard to shut down. The throughline is the instrument: take a frontier model, define the test on which its safety will be argued, and publish a result the labs and governments treat as the evidentiary baseline. The instrument does not change. Only the model under test does.

Mandate & Origin

METR — pronounced “meter,” for Model Evaluation & Threat Research — was incubated as the evaluations team inside Paul Christiano’s Alignment Research Center, where it was known as ARC Evals, and spun out into a standalone 501(c)(3) nonprofit, announced December 4, 2023.

  • Stated mission, verbatim: “METR’s mission is to develop scientific methods to assess catastrophic risks stemming from AI systems’ autonomous capabilities and enable good decision-making about their development.”
  • Self-description, verbatim: “METR (pronounced ‘meter’) is a research nonprofit that scientifically measures whether and when AI systems might threaten catastrophic harm to society.”
  • Founded and led by Beth Barnes (Founder, CEO), who by her own team-page bio previously worked at OpenAI — evaluating scalable-oversight techniques and screening code models for misalignment before release — and collaborated with DeepMind’s chief scientist on scaling laws.

Funding & Backers

METR publishes its funder list and states plainly that it takes no money from the companies it grades.

  • Independence statement, verbatim: “METR has not accepted funding from AI companies, though we make use of significant free compute credits.”
  • Named backers on METR’s own about page include Schmidt Sciences, the Survival and Flourishing Fund (the philanthropic vehicle associated with Jaan Tallinn), the Audacious Project, the Sijbrandij Foundation, the Pew Charitable Trusts, Longview Philanthropy, the UK AI Security Institute, and the European AI Office.

The recurrence worth naming without asserting a hand: Schmidt Sciences funds METR, and the same Schmidt funding footprint reaches across the broader evaluation apparatus (see the file on Eric Schmidt). The grader is paid by a node that also funds much of the layer it operates in. That is the shape of the institution. The arithmetic is the finding.

Institutional Voice & Intent

METR’s public register is the most disciplined in the field: careful, empirical, uncertainty-forward, allergic to the breathless. It is the voice of a lab that measures and refuses to overclaim — and that refusal is itself the credential.

  • Representative framing, verbatim from its Claude 3.7 Sonnet evaluation: “While we failed to find significant evidence for a dangerous level of autonomous capabilities.” The construction is deliberate — not “the model is safe,” but “we failed to find evidence,” which keeps the burden on the measurement and the uncertainty on the page.
  • And, in the same report, the line that distinguishes a grader from a clearance-issuer: “we believe that pre-deployment capability testing is not a sufficient risk management strategy by itself.”

Stated intent: Develop scientific methods to assess catastrophic risks from autonomous AI capabilities; enable good decisions about development; advise developers and governments on risk-assessment methodology.

Observed intent: Define the evaluation on which a frontier model’s pre-deployment safety is publicly argued; publish the result the labs cite and governments treat as baseline; hold the measurement standard for “can it act autonomously.”

Gap: Small in word, structural in effect. METR genuinely measures rather than certifies, and says so. But the empirical hedging is precisely what makes the number authoritative — a body that refuses to overclaim is the body whose claims travel furthest. The stated intent (“measure, advise”) and the observed reach (define the test everyone defers to) overlap wherever “the most careful measurement” and “the de facto standard” are the same artifact. The record does not settle whether METR’s authority is earned restraint or accumulated position. For the grader, it never needs to.

Position in the Apparatus

METR grades the frontier directly, and the receipts are the labs’ own documents.

  • OpenAI / GPT-4: OpenAI’s GPT-4 System Card credits ARC (METR’s prior incarnation) with the pre-deployment autonomy evaluation, verbatim: “we facilitated a preliminary model evaluation by the Alignment Research Center (ARC) of GPT-4’s ability to carry out actions to autonomously replicate and gather resources.” OpenAI granted ARC early access to “multiple versions” of the model.
  • Anthropic / Claude: METR ran a pre-deployment evaluation of Claude 3.7 Sonnet, noting that “Anthropic shared a helpful ’testing guide’ with METR alongside access to a pre-deployment version of the model.”
  • UK government: the Frontier AI Taskforce (which became the UK AI Safety Institute, now the AI Security Institute) announced it would work with ARC Evals to “assess risks just beyond the frontier” — and that same UK institute now appears on METR’s funder list, a two-way adjacency between the grader and the state evaluator.

The adjacency to name without a verdict: METR’s founder came from OpenAI; METR grades OpenAI; the UK state evaluator both collaborates with and funds METR; Schmidt money sits in the funder column alongside the rest of the apparatus. Not a cabal. A circuit — and METR is one of its load-bearing junctions.

Actions & Leadership Choices

The PR says “measure, don’t certify.” The deeds say something more entangled, and the entanglement is in the founding purpose, not in a scandal.

Actual founding purpose. METR was not built as a neutral metrology lab. It was built as the evaluation arm of Paul Christiano’s Alignment Research Center — ARC Evals — whose job from the start was to operationalize a specific thesis: that the catastrophic risk to watch is autonomous capability, and that the way to govern it is for labs to test for it before release. The purpose was to make that thesis testable and then make the test the gate. That purpose has been served.

  • It wrote the self-governance template, then became the grader of it. In September 2023, while still ARC Evals, the team published the foundational “Responsible Scaling Policies” framing and disclosed that it had helped Anthropic write its RSP version 1.0. Within months OpenAI (“Preparedness Framework”) and Google DeepMind adopted broadly similar regimes. METR did not merely measure the frontier — it authored the document the labs now point to as their commitment, and it grades the labs against that very regime. The architect of the self-regulation standard is also its examiner.
  • The one place the labs touch its resources is the one place to watch. METR refuses funding from the companies it grades and says so. But it makes use, by its own wording, of “significant free compute credits” — supplied by those companies. When the values were tested by the question does the grader take anything from the graded, METR drew the line at cash and crossed it at compute. The distinction is real and disclosed; it is also the single thread connecting the examiner’s resources to the examined.
  • When it had a finding, it published the hedge, not the brake. METR’s verdicts (“we failed to find significant evidence for a dangerous level of autonomous capabilities”) have never, on the public record, stopped a release — and METR is explicit that “pre-deployment capability testing is not a sufficient risk management strategy by itself.” It measures; the lab decides; the deployment proceeds. The cost of the value (real independence) is borne in the hedge; the lab pays nothing for the grade.

Leadership choices. Founder and Co-CEO Beth Barnes is a former OpenAI alignment researcher who left in 2022; the policy seat — director of policy Chris Painter — reviewed an early draft of Anthropic’s revised RSP with Anthropic’s permission. The leadership is drawn from the lab it most closely grades and consults for the lab on the very standard it then evaluates. That is not a hidden hand; it is the published org chart. The grader, the standard’s author, and the standard’s first adopter share people and a draft history.

CONDUCT: CONFLICTED — GRADER OF ITS OWN BLUEPRINT. METR is genuinely independent of lab cash and genuinely careful in its measurements; it is also the body that wrote the self-governance regime, helped a frontier lab draft its first version, evaluates that lab against it, and runs on that lab’s compute. The earnestness is real and the conflict is structural — the examiner authored the exam.

Reach Assessment

Institutional: Extreme for an organization of its size. Two of the leading American labs route pre-deployment access through METR’s evaluation; the results enter the public system cards that regulators, journalists, and other labs read as the canonical record of “what the model could do before release.” A nonprofit with a small headcount sits on the pre-deployment path of the most capable systems built.

Memetic: High. METR did not invent the “autonomous replication and adaptation” frame, but it operationalized it — turned the abstract fear of a self-copying model into a measurable task suite that the rest of the field now argues inside. When the conversation about frontier danger is conducted in the vocabulary of capability thresholds and autonomy evaluations, it is conducted on terrain METR helped grade into existence.

Civilizational: High. METR does not decide whether a model ships. It decides what the evidence says about whether the model is dangerous — and the shipping decision is made in the light of that evidence. That is upstream of deployment. A grader whose measurement is trusted enough to be definitive holds the kind of reach that needs no enforcement power: the verdict is advisory, and everyone waits for it anyway.


Sources: METR announcement (formerly ARC Evals) — metr.org, Dec 4 2023; METR — About; Beth Barnes — METR team page; METR — Claude 3.7 Sonnet evaluation report; GPT-4 System Card — OpenAI; Frontier AI Taskforce first progress report — gov.uk; Responsible Scaling Policies — METR (formerly ARC Evals), Sep 2023; METR — Wikipedia.

ATK 9 ACCELERATION
DEF 8 PROTECTION
HP 7 RESILIENCE
OLYMPUS RISK INTELLIGENCE PROTOCOL does not exist. It was assembled in a GitHub issue thread in October 2023 by engineers who had read the extinction risk letter and wanted to understand who specifically had signed a document saying AI might kill everyone and then continued working on AI. These dossiers are satire. The biographical facts cited are sourced from published reporting, public statements, academic papers, and court records. The psychometric scores are not clinical assessments. No part of this constitutes professional psychological evaluation or diagnosis. Do not use these dossiers to make decisions about anything.