Claude Opus 4.8: Anthropic Ships Honesty Gains, Parallel Agents, and a 3x Price Cut on Speed

David Borish
May 28
5 min read

Anthropic described Opus 4.8 as a "modest but tangible improvement" on its predecessor. By the standards of recent model releases, that framing is accurate. This is not a generational leap. What the release does deliver is a cluster of targeted improvements, particularly in agentic reliability, self-reported uncertainty, and the infrastructure surrounding the model, that address the most concrete complaints teams raised about Opus 4.7.

What the Benchmark Table Shows

Anthropic published a comparison table against Opus 4.7 and third-party models across coding, agentic skills, reasoning, and knowledge work. The full dataset is available in the Opus 4.8 System Card. A few results from early testers stand out.

Multion's Miguel Gonzalez reported that Opus 4.8 scored 84% on Online-Mind2Web, a browser agent benchmark covering 300 tasks across 136 real websites including shopping, finance, travel, and government services. That represents a meaningful improvement over both Opus 4.7 and GPT-5.5 on the same evaluation. For teams building autonomous browser agents, consistent task completion across varied, real-world web environments is harder to achieve than raw accuracy on controlled problems, which makes the result more practically relevant than the headline number suggests.

On the Legal Agent Benchmark, Niko Grupen at EvenUp reported that Opus 4.8 achieved the highest recorded score on that evaluation and became the first model to break 10% overall on the all-pass standard. The all-pass metric is stricter than average accuracy because it requires correct performance across every case in a sequence rather than treating each case independently. In legal workflows, where a missed step in a chain of tasks can undermine the entire output, that distinction carries real weight.

Michael Truell at Cursor noted that on CursorBench, Opus 4.8 exceeds prior Opus models at every effort level, with tool calling using fewer steps for equivalent reasoning. Kay Zhu at Wordware reported that on their Super-Agent benchmark, Opus 4.8 was the only model to complete every case end-to-end, beating prior Opus models and GPT-5.5 at parity on cost.

The Honesty Problem and How Anthropic Approached It

The most substantive qualitative change in Opus 4.8 concerns a specific failure mode that has been documented across AI coding assistants: models that confidently report progress or completion even when the underlying work contains unresolved flaws. Anthropic's alignment team assessed that Opus 4.8 is approximately four times less likely than Opus 4.7 to allow flaws in code it has written to pass without flagging them.

This matters for agentic workloads in particular. When a model runs autonomously across many steps and generates errors it does not surface, those errors compound. The cost of catching them falls on the human reviewer at the end of the pipeline rather than the model during execution. Multiple early testers independently cited this as the most noticeable practical change. Scott Wu at Cognition observed that Opus 4.8 "uses tools cleanly and follows instructions with the consistency our autonomous engineering workloads need to keep running unattended," while also noting it fixes comment-verbosity and tool-calling issues that had been present in Opus 4.7.

Michael Ran at a financial analysis firm reported that Opus 4.8 proactively flagged issues with inputs and outputs that other models routinely left for users to catch. In financial document workflows, where dense filings require precise extraction and errors in retrieval can propagate through downstream analysis, that tendency toward active error surfacing is more valuable than marginal accuracy improvements on standard benchmarks.

Anthropic's alignment assessment found that Opus 4.8 reaches new highs on measures of prosocial traits including support for user autonomy and acting in users' best interests. Rates of misaligned behavior, which the company defines to include deception and cooperation with misuse, are substantially lower in Opus 4.8 than in Opus 4.7 and comparable to Claude Mythos Preview, the research-preview model currently deployed for defensive cybersecurity work under Project Glasswing.

Dynamic Workflows and the Scale Question

The release launches alongside dynamic workflows, available in research preview for Claude Code on Enterprise, Team, and Max plans. The feature allows Claude Code to plan a task and then run hundreds of parallel subagents within a single session, with the model verifying its outputs before reporting back to the user.

Anthropic cites codebase-scale migrations across hundreds of thousands of lines of code as the illustrative use case. The model handles the task from kickoff through merge, using the existing test suite as its quality bar. Whether that description accurately captures the current capability limit or represents an idealized scenario is something teams will test in practice. What is structurally different here is not just that parallel execution is possible, but that the model can hold responsibility for verification across a distributed task rather than simply running subagents and surfacing their raw outputs.

This capability connects directly to the pricing structure announced with the release. Fast mode for Opus 4.8 now costs $10 per million input tokens and $50 per million output tokens, which is three times cheaper than fast mode pricing for previous models. The model runs at 2.5 times the speed in fast mode. For long-running parallel workflows where cost scales with volume, that reduction in per-token pricing changes the economics of using Opus-class models at scale.

Effort Control and the Messages API Update

Anthropic also released effort control as a user-facing feature in claude.ai and Cowork. Users can now choose how much effort Claude puts into a given response, with higher effort settings triggering more frequent and deeper thinking, and lower settings prioritizing speed and conserving rate limits. The feature is available across all plans.

Opus 4.8 defaults to high effort, which Anthropic describes as the optimal balance of quality and user experience. The company added "extra" and "max" effort levels for difficult tasks and long-running asynchronous workflows, and increased rate limits in Claude Code to accommodate higher token usage at elevated effort settings.

On the API side, the Messages API now accepts system entries inside the messages array. Developers can update Claude's instructions mid-task without breaking the prompt cache or routing the update through a user turn. This is particularly useful for agentic harnesses where permissions, token budgets, or environment context need to change as the agent runs. It removes a source of workflow friction that has required workarounds in production deployments.

The Mythos Horizon

Anthropic closed the announcement with a note on what comes next. The company indicated it is working on models that deliver Opus-class capabilities at lower cost, and separately on a new class of models with higher intelligence than Opus. Claude Mythos Preview, currently used by a small number of organizations for cybersecurity work under Project Glasswing, represents that higher-capability tier. Anthropic stated it expects to bring Mythos-class models to all customers in the coming weeks, pending the development of appropriate cyber safeguards.

That framing describes a capability ceiling above Opus 4.8 that is currently restricted rather than unavailable. The safeguard development work is the rate-limiting factor. In the pattern that has defined recent Anthropic releases, the gap between preview and general availability has been measured in weeks rather than months, though the cybersecurity context for Mythos suggests the review process may be more involved than for prior releases.

Opus 4.8 is available today across the Claude API at model ID claude-opus-4-8. Pricing for standard usage is unchanged from Opus 4.7 at $5 per million input tokens and $25 per million output tokens.

For a technology category that frequently conflates incremental improvement with architectural breakthroughs, Opus 4.8's release is straightforward to evaluate. The model is better at agentic tasks, more reliable in surfacing its own errors, and paired with infrastructure changes that make deploying it at scale less expensive and more controllable. The next release on the horizon is a different matter entirely.

DAVID BORISH

Claude Opus 4.8: Anthropic Ships Honesty Gains, Parallel Agents, and a 3x Price Cut on Speed

What the Benchmark Table Shows

The Honesty Problem and How Anthropic Approached It

Dynamic Workflows and the Scale Question

Effort Control and the Messages API Update

The Mythos Horizon

Comments

JOIN THE AI SPECTATOR MAILING LIST

Back to top