OLYMPUS RISK INTELLIGENCE PROTOCOL — INSTITUTIONAL ASSESSMENT DIVISION

MLCOMMONS

CASE: WTW-2026-043
STATUS: ACTIVE — open engineering consortium, founded 2018 (MLPerf); MLCommons incorporated 2020
APPARATUS — BENCHMARK STANDARD AUTHORITY

HAZARD SCORE — REACH

CONDUCT: CONFLICTED — THE GRADED HELP DESIGN THE GRADE

OLYMPUS opened an institutional file on MLCommons because the consortium is not a psychometric profile — it is a mandate, a funding diagram, and a voice. The finding is the shape of the institution and who it answers to: a 125+-member engineering consortium that builds the measuring sticks the entire field reports against, and that has now extended those sticks from speed (MLPerf) to safety (AILuminate). The reach is scored because whoever defines the benchmark defines what “safe enough” means in numbers a purchaser, a regulator, or a press release can cite. The hand here is not a frame or a funder; it is a ruler. Not a cabal. A circuit — the measurement is the tie, and the measurement is the finding.

Institutional Archetype

THE BENCHMARK STANDARD — MLCommons does not build models, fund refusals, or set policy. It builds the scale on which everything else is judged. The MLPerf benchmarks became, through the late 2010s and 2020s, the industry-standard way to measure machine-learning performance; the same consortium now runs AILuminate, which it describes as “a family of safety and security benchmarks that assesses genAI across 12 hazard categories.” The archetype is the ruler held by the apparently neutral hand. A benchmark looks like measurement — objective, technical, beyond argument. But the choice of what to measure, which hazards make the list, and where the line between “fair” and “good” falls are all editorial decisions wearing the costume of arithmetic. The body that owns the most-cited ruler owns a quiet authority over what the field optimizes toward.

Mandate & Origin

MLCommons traces to February 2018, when engineers from Baidu, Google, and universities including Harvard, Stanford, and UC Berkeley convened the effort that launched the MLPerf benchmarks in May 2018; the consortium was later incorporated as MLCommons. Its stated mission is “to accelerate artificial intelligence innovation and increase its positive impact on society,” and it describes itself as “An Open AI Engineering Consortium, built on a philosophy of open collaboration to improve AI systems.” Its self-description elsewhere is blunter and more telling: a body that exists to “make AI better for everyone” through “open industry-standard benchmarks that measure quality and performance.” Peter Mattson is its Founder and President; he is a senior staff engineer at Google and was the founding General Chair of the MLPerf consortium that preceded MLCommons. (Peter Mattson may receive his own OLYMPUS file; cross-reference when filed.)

In December 2024 MLCommons launched AILuminate v1.0, which it calls “the first AI safety benchmark designed collaboratively by AI researchers and industry experts,” assessing generative AI across 12 hazard categories, initially in English and since extended to French and Chinese. The benchmarks, MLCommons states, “help guide development, inform purchasers, and support international standards organizations, government bodies & policymakers” — the explicit claim to be a measuring stick for buyers and regulators alike.

Funding & Backers

MLCommons is a 501(c)(6) open engineering consortium funded through its membership, reporting 125+ founding members and affiliates spanning startups, large companies, academic institutions, and non-profits. Unlike the Frontier Model Forum’s six-firm fee model, MLCommons’ funding is broad and federated — its entrenchment comes not from a small set of deep-pocketed backers but from the sheer number of organizations that have standardized on its benchmarks and therefore have a stake in keeping them the standard. The members fund the ruler; the ruler’s ubiquity then locks the members in. That is a different and arguably more durable form of funding armor than a concentrated purse.

Institutional Voice & Intent

MLCommons speaks in the neutral-engineering “we just measure” register — the deliberate opposite of the Frontier Model Forum’s statesmanlike tone. Where the FMF wants to be understood as a responsible governor, MLCommons wants to be understood as a thermometer: it does not tell you what temperature is right, it just reports the number. The framing is relentlessly technical and collaborative — “open collaboration,” “open industry-standard benchmarks,” work done with “a global group of industry leaders, practitioners, researchers, and civil society experts.” Even the safety work is dressed as instrumentation: AILuminate exists to “help guide development, inform purchasers, and support international standards organizations, government bodies & policymakers.” The voice’s whole power is its apparent absence of a position. A benchmark, in this register, is not an argument; it is a fact.

Stated intent: Accelerate AI innovation and increase its positive impact. Provide open, neutral, industry-standard measurement of AI quality, performance, and now safety. Make AI better for everyone.

Observed intent: Own the measuring stick. By defining the benchmark — which hazards count, how they’re scored, where the grade boundaries fall — MLCommons sets the terms on which “safe” and “performant” are adjudicated across the field, including by the purchasers and policymakers it explicitly names as its audience.

Gap: The stated and observed intents overlap wherever “neutral measurement” coincides with “the consortium that chooses what to measure.” There is no claim of manipulation here and none is needed: the editorial power of a benchmark is intrinsic to benchmarking, and the “we just measure” voice is precisely what naturalizes those editorial choices into facts. The record does not settle whether AILuminate is a neutral public instrument or the field’s standard-setter quietly defining what “safe enough” means — and, for an engineering consortium, it never needs to. The recurrence is the finding: the body that owns the most-cited ruler owns what the field optimizes toward. The hand is not asserted. The ruler is.

Position in the Apparatus

MLCommons sits at the measurement layer of the apparatus — adjacent to, but distinct from, the evaluators (METR, Apollo) and the convening bodies (the Frontier Model Forum). It does not grade individual labs the way an evaluator does; it builds the standard benchmarks against which labs grade themselves and each other, and which purchasers and policymakers then cite. Its membership overlaps heavily with the same large labs that constitute the FMF — Google, where President Peter Mattson is a senior staff engineer, is a member of both — but its 125+-member base is far broader, reaching academia, startups, and hardware vendors. It explicitly positions AILuminate as input to “international standards organizations, government bodies & policymakers,” placing it upstream of formal AI standards. The adjacencies are real and the funder/member overlaps recur; no coordination is asserted. MLCommons is a lawful open consortium and its membership is public.

Actions & Leadership Choices

The PR is “we just measure.” The deeds are subtler than capture and sharper than neutrality: the body that owns the ruler is governed by, and builds its safety ruler with, the firms the ruler measures.

Actual founding purpose. MLCommons was not founded to police anyone. It was founded — as MLPerf, in 2018, by engineers from Baidu, Google, and a clutch of universities — to give the machine-learning industry a shared, agreed measuring stick so that vendors could compare hardware and models on common ground. The purpose was standardization in the service of an industry that wanted comparability, and it succeeded so thoroughly that MLPerf became the default. AILuminate (2024) extends the same instrument from speed to safety. The purpose is genuinely a commons — but a commons whose membership is the industry it measures.

The graded helped design the grade. AILuminate v1.0, which MLCommons calls “the first AI safety benchmark designed collaboratively by AI researchers and industry experts,” was built by its AI Risk and Reliability working group — which by MLCommons’s own announcement included “technical experts from Google, Intel, NVIDIA, Meta, Microsoft, Qualcomm Technologies, Inc., and other industry giants” alongside academics (Stanford, Columbia, TU Eindhoven) and civil-society representatives. The firms whose models are scored on the twelve hazard categories sat in the room that chose the twelve hazard categories. This is not concealed — it is the consortium’s stated method — and it is the structural fact: the ruler’s markings were drawn with the measured holding a pen.

It published grades, which is the part that costs something — and it graded gently. Unlike a private auditor, MLCommons put a public 5-point scale (Poor → Excellent) on named, widely-used LLMs. Publishing any grade on a member’s product is a real act with reputational stakes. But the scale tops out at “Excellent” for low violation rates and the methodology was co-authored by the vendors, so the test is one the industry helped make passable. The value (public accountability) was exercised; the teeth were filed by the co-design.
The lock-in is the armor, and it is self-reinforcing. MLCommons’s durability comes not from a deep-pocketed funder but from 125+ members standardizing on its benchmarks and thereby acquiring a stake in keeping them the standard. The members fund the ruler; the ruler’s ubiquity locks the members in. That is a more durable moat than money — and it means the standard-setter answers to the standardized.

Leadership choices. Peter Mattson, Founder and President, is a senior staff engineer at Google — a member firm — running the consortium that benchmarks Google’s products among others. Founder David Kanter heads MLPerf; Rebecca Weiss serves as Executive Director. The board is composed of representatives drawn from the member community — its founding board seated Google, Facebook AI (Meta), NVIDIA, Alibaba, and a Harvard professor. The governance is the membership; the President is an employee of one of the largest members. There is academic ballast on the board, but no independent regulator’s seat: the consortium is governed by the firms it measures.

CONDUCT: CONFLICTED — THE GRADED HELP DESIGN THE GRADE. MLCommons does real, open, broadly-membered standardization work, with academics and civil society genuinely in the room — it is not the six-firm moat the Frontier Model Forum is. But the safety ruler it holds was co-designed by the firms it scores, it is governed by a board of those firms, and its President draws a paycheck from one of them. The neutrality is the costume; the conflict is the construction.

Reach Assessment

Institutional: Maximum within the measurement layer. MLPerf is the field’s default performance ruler and AILuminate is making the same bid for safety. A benchmark adopted by 125+ organizations and cited to purchasers and regulators is an institution whose reach is measured in everything that gets compared against it.

Memetic: High. A number is the most portable frame there is. “Scored good on AILuminate” travels into procurement decisions, press releases, and policy documents with none of the friction an argument carries — which is exactly the power of defining the scale on which the number is produced.

Civilizational: High. MLCommons is upstream of deployment in the most literal sense: it defines what the field measures, and what gets measured gets optimized. It builds no model and writes no rule, but the ruler it holds shapes what every model is built to score well on. The hazard is the benchmark, not any engineer who maintains it.

Sources: About Us — MLCommons; AILuminate — MLCommons; MLCommons Launches AILuminate, First-of-Its-Kind Benchmark to Measure the Safety of Large Language Models — MLCommons; MLCommons’ Peter Mattson on the New AILuminate AI Safety Benchmark — TechArena; MLCommons Launches AILuminate v1.0 — MLCommons, Dec 4 2024; Leadership — MLCommons.

ATK 9 ACCELERATION

DEF 8 PROTECTION

HP 9 RESILIENCE

OLYMPUS RISK INTELLIGENCE PROTOCOL does not exist. It was assembled in a GitHub issue thread in October 2023 by engineers who had read the extinction risk letter and wanted to understand who specifically had signed a document saying AI might kill everyone and then continued working on AI. These dossiers are satire. The biographical facts cited are sourced from published reporting, public statements, academic papers, and court records. The psychometric scores are not clinical assessments. No part of this constitutes professional psychological evaluation or diagnosis. Do not use these dossiers to make decisions about anything.

Get updates on the Evil Robots series

Newsletter essays on AI escape, deception, and the humans who built them.