Claude Mythos Just Broke METR's AI Benchmark Further Validating The Exponential Replacement Curve

David Borish
24 hours ago
5 min read

claude mythos brok metr ai benchmark — Claude Mythos Just Broke METR's AI Benchmark Further Validating The Exponential Replacement Curve

On May 8, 2026, METR added Claude Mythos Preview to its task-completion time horizon tracker and immediately posted a caveat that has not appeared on any previous update: "Measurements above 16 hours are unreliable with our current task suite." The model's estimated 50 percent time horizon landed at 16 hours or more, with a 95 percent confidence interval running from 8.5 hours to 55 hours. The spread alone tells you how thin the data is at this range. METR's suite of 228 tasks includes only five estimated at 16 hours or longer for human experts. The benchmark was not designed for models this capable.

This is a significant moment for reasons that go beyond a single model's score. The time horizon metric, which METR introduced in March 2025, measures the length of a software task (calibrated by how long a human expert takes to complete it) that an AI agent can complete with a given reliability. It has become the most watched capability benchmark in AI, updated every time a new frontier model ships. GPT-4o, released in mid-2024, had a 50 percent time horizon of roughly 7 minutes. Claude 3.7 Sonnet reached about 2 hours. By late 2025, Claude Opus 4.5 and GPT-5 were clustering around 5 to 6 hours. Now Mythos Preview has pushed past what the measurement infrastructure can reliably capture.

The Acceleration Is the Story

The headline number matters less than the rate of change behind it. METR's original March 2025 paper found that frontier time horizons were doubling approximately every 7 months across the full 2019 to 2025 period. When METR released its updated methodology (Time Horizon 1.1) in January 2026, the post-2023 data told a different story. The doubling time had compressed to 130.8 days, or about 4.3 months. One independent analysis of the 2024 to 2025 data pegged it even faster, at roughly 105 days.

This acceleration is exactly what my paper Exponential Replacement Curve framework projects. The ERC's core mechanism is a capability doubling rate that compounds across improvements in algorithms, chips, and compute. When the doubling time shortens, it does not produce a linear speedup in the displacement timeline. It compresses the gaps between waves. A shift from 7-month to 4.3-month doubling does not move timelines forward by 40 percent. It collapses the space between the first wave (administrative, customer service, content creation) and the second wave (sales, marketing, finance, software development) and pulls the third wave (research, strategic planning, creative direction) substantially closer.

What Mythos Preview Actually Demonstrates

The Mozilla Firefox team offered what may be the most concrete real-world signal so far. Using Mythos Preview, they fixed 423 security bugs in April 2026, compared to a prior monthly average of 17 to 31. The model identified vulnerabilities requiring multi-component reasoning across large codebases, including a 20-year-old XSLT bug and a race condition that could enable sandbox escape.

The UK AI Security Institute found that Mythos can autonomously execute a complete corporate network takeover, succeeding in 30 percent of attempts on a complex attack range that AISI estimates would require roughly 20 hours for a human expert.

These are tasks that require sustained, multi-step reasoning across large bodies of code and system architecture. They are precisely the kind of work that Karpathy's AI exposure treemap scores at 8 to 9, and that the Exponential Replacement Curve identifies as second-wave displacement territory. The difference between where these tasks sat on the theoretical timeline six months ago and where they sit now reflects the compressed doubling time that METR's data confirms.

The Gap Between Benchmark and Reality Persists

METR is careful to note what its time horizon metric does and does not measure. A 16-hour time horizon does not mean Mythos can autonomously perform 16 hours of real-world work. The tasks in METR's suite are self-contained, well-specified, and algorithmically scored. Real-world work involves stakeholder communication, tacit organizational knowledge, and success criteria that resist automatic evaluation. METR's own research found that when experienced open-source developers used early-2025 AI tools on their own repositories, they were 19 percent slower than without AI assistance.

This gap between benchmark performance and real-world productivity is consistent with the Anthropic Economic Index finding that computer and math occupations have 94.3 percent theoretical AI coverage but only 35.8 percent observed coverage. The distance between what AI can do on clean tasks and what organizations actually deploy it for remains substantial.

The Exponential Replacement Curve framework accounts for this gap by distinguishing between technical capability thresholds and actual displacement timelines. The capability threshold is what METR measures. The displacement timeline depends on organizational adoption speed, integration costs, regulatory friction, and the degree to which tasks can be cleanly specified. But the METR data shows the capability side of that equation is moving faster than the framework's original projections assumed.

The Evaluation Infrastructure Problem

Perhaps the most significant finding from the Mythos evaluation is not about the model at all. METR has acknowledged that its current task suite was not designed for models at this capability level. Five tasks above 16 hours is not enough to draw a reliable logistic curve. The organization says it can still distinguish Mythos from publicly available models, but precise quantification at this range requires new tasks, new human baselines, and substantially more evaluation capacity.

This is a pattern that recurs across AI capability measurement. Benchmarks saturate. The instruments designed to measure progress become inadequate before the progress stops. METR is already working on expanding its task suite, but the speed at which frontier models are advancing means the evaluation infrastructure is perpetually catching up.

For the Exponential Replacement Curve, benchmark saturation is itself a data point. When the leading capability benchmark runs out of road, it suggests that the next capability threshold is being crossed faster than the measurement tools can track. The ERC's wave model projects displacement pressure building in advance of widespread organizational awareness, and evaluation infrastructure lagging behind capability is one mechanism by which that dynamic plays out.

What the Compressed Doubling Time Means for Labor Markets

The Stanford/ADP data already showed that employment for workers aged 22 to 25 in AI-exposed occupations fell 6 percent between late 2022 and July 2025, with young software developers down 20 percent from their late 2022 peak. The Dallas Federal Reserve found this was driven by reduced hiring rather than layoffs. The on-ramp into knowledge work has been narrowing for three years.

A compressed doubling time means the capability frontier is reaching further into higher-complexity work faster than previous projections anticipated. The Exponential Replacement Curve's second wave, covering roles that require multi-step reasoning, sustained context, and domain expertise, was projected based on the 7-month doubling time from METR's original paper. At 4.3 months, the timeline for second-wave displacement pressure compresses significantly.

The BLS still projects software developer employment growing 17.9 percent through 2033. But that projection was made before a model could fix 423 security bugs in a month that human teams were averaging 17 to 31, and before the leading capability benchmark had to flag its own measurements as unreliable because the model exceeded the suite's design ceiling.

Three Takeaways

The METR data confirms that the exponential trend in AI task-completion capability has not slowed. It has accelerated. The post-2023 doubling time of 4.3 months is faster than what any of the major AI labor market analyses, including the original Exponential Replacement Curve, used as their baseline projection.

METR's evaluation infrastructure hitting its ceiling is itself evidence of the pace of change. When the measurement tools designed to track exponential progress can no longer keep up, the rate of change has outrun the institutions built to monitor it.

The practical implication is that the gap between the Exponential Replacement Curve's first and second waves is narrower than the original framework projected. The capability threshold for multi-hour, multi-step autonomous technical work is being crossed now, while the labor market is still processing the early signals from wave one. How quickly organizations close the gap between theoretical and observed AI adoption will determine whether the displacement pattern looks like a sequence of distinct waves or a single accelerating curve.

Click image to read the previous article

DAVID BORISH

Claude Mythos Just Broke METR's AI Benchmark Further Validating The Exponential Replacement Curve

The Acceleration Is the Story

What Mythos Preview Actually Demonstrates

The Gap Between Benchmark and Reality Persists

The Evaluation Infrastructure Problem

What the Compressed Doubling Time Means for Labor Markets

Three Takeaways

Comments

SIGN UP FOR MY NEWSLETTER

ARTIFICIAL INTELLIGENCE, BUSINESS, TECHNOLOGY, RECENT PRESS & EVENTS

Back to top