How a Tokyo AI Lab Reached Frontier Performance Without Building a Frontier Model

David Borish
3 days ago
6 min read

sakana ai launches FUGU — How a Tokyo AI Lab Reached Frontier Performance Without Building a Frontier Model

Sakana AI has spent two years arguing that the most powerful AI systems will come from collaboration between models rather than from scaling any single one. On June 22, 2026, that argument became a product.

Fugu and Fugu Ultra, released today out of Sakana's Tokyo headquarters, present a multi-agent orchestration system through a single OpenAI-compatible API endpoint. Send a request, and Fugu decides how to handle it: solving directly when that is sufficient, or assembling and coordinating a team of specialized models when the task demands more. The routing, delegation, verification, and synthesis happen internally. To the developer, it looks like one model.

The benchmark headline is that Fugu Ultra performs at or near the level of Anthropic's Fable 5 and Mythos Preview across a suite of coding, reasoning, and scientific benchmarks. Those two Anthropic models are not in Fugu's agent pool. They are not publicly accessible. Fugu reaches that performance level using only models it can actually reach.

What Fugu Is, and What It Is Not

Fugu is not a wrapper or a fixed routing table built on if/else logic. Sakana describes it as a language model trained specifically to orchestrate other models: learning when to delegate, how agents should communicate with each other, and how to combine their outputs into a coherent final answer. The distinction matters because fixed workflows break when the task is unfamiliar. A trained orchestrator can adapt.

The technical foundation comes from two papers Sakana published at ICLR 2026. The first, TRINITY, describes a coordinator that assigns models to roles across a multi-turn task: Thinker, Worker, and Verifier. The second, Conductor, uses reinforcement learning to discover natural-language coordination strategies, training the system to figure out how to prompt and route agents for maximum performance rather than having engineers design those workflows by hand. The research framing is important because orchestration is a word that has been applied to a lot of things, many of them just prompt chaining.

Fugu also supports recursive self-calls. Fugu can call instances of itself as part of its agent pool, which extends the coordination depth for long, multi-step tasks without requiring a separate scaffolding layer.

The two product variants target different workloads. Fugu balances performance with low latency and fits naturally into interactive tools like coding assistants, code review pipelines, and chatbots. Teams with privacy or compliance constraints can exclude specific agents from the pool. Fugu Ultra is optimized for maximum answer quality on hard, multi-step problems: AI research, paper reproduction, cybersecurity analysis, patent and literature investigation. Early users running Fugu Ultra in an almost fully automated research mode saw it sustain meaningful progress across open-ended problems with little human intervention.

The Benchmark Results, Read Carefully

Sakana published performance figures across eleven benchmarks covering coding, reasoning, science, and agentic tasks, comparing Fugu and Fugu Ultra against publicly accessible frontier models — Opus 4.8, Gemini 3.1 Pro, and GPT-5.5 — plus Anthropic's limited-access Fable 5 and Mythos Preview.

Before any number: these are Sakana's own results. Baseline scores for competing models use provider-reported figures, which means the comparison is not apples-to-apples on harness settings or scaffolding choices. The SWE-Bench Pro scores used mini-swe-agent as scaffolding. Read these as a vendor self-report pending independent replication.

With that in mind, the specific figures tell a more nuanced story than the "shoulder-to-shoulder" framing in Sakana's announcement suggests. On SWE-Bench Pro, Fugu Ultra scores 73.7, ahead of Opus 4.8 at 69.2, GPT-5.5 at 58.6, and Gemini 3.1 Pro at 54.2. Fugu Ultra trails Fable 5 on that same benchmark, the very model it cannot include in its pool. On LiveCodeBench, Fugu scores 92.9 and Fugu Ultra 93.2, both ahead of Gemini 3.1 Pro's 88.5. On GPQA-Diamond, Fugu Ultra reaches 95.1.

The practical read: Fugu Ultra leads the publicly accessible frontier on most of the benchmarks Sakana published, and sits close to but sometimes below Anthropic's restricted models. For organizations that cannot access Fable 5 or Mythos Preview due to export controls or procurement constraints, that gap is academic. The meaningful comparison is against what they can actually use.

Sakana's application benchmarks add a different kind of evidence. Comparing Fugu Ultra against Gemini 3.1 Pro (high), Opus 4.8 (max), and GPT-5.5 (xhigh) across tasks including automated research, mechanical design, Japanese handwriting analysis, financial time series prediction, and one-shot chess, Fugu models consistently outperformed all three baselines. These are Sakana-run comparisons, not independent evaluations.

What Beta Users Found

Close to 500 early users put Fugu through real workflows during the beta program. Three patterns came up repeatedly: code review, security assessment, and automated research.

One software engineer reported that Fugu Ultra surfaced more than twenty issues during code review where other tools flagged about three.

A cybersecurity engineer ran a complete security assessment from a single scoped instruction, covering reconnaissance, XSS and SQL injection checks, authentication review, and a final report with evidence and retest steps, staying within the defined scope throughout. An executive at an enterprise platform company noted that Fugu showed unusually strong persona stability across long sessions, holding its character in places where other models drift, which they flagged as potentially more important for agent products than raw benchmark scores.

The beta feedback reinforced the product's intended use case. Multi-agent orchestration delivers the most value on tasks that are long, messy, and resistant to single-model calls: problems where progress means reading, implementing, testing, comparing evidence, identifying gaps, and producing a synthesized result across many steps.

Pricing and Availability

Fugu Ultra is priced at $5 per million input tokens and $30 per million output tokens, with rates doubling for contexts above 272,000 tokens. Subscription tiers are available at $20, $100, and $200 per month, covering both Fugu and Fugu Ultra. Subscribers who sign up before the end of July 2026 receive a free second month at their initial tier. Pricing for multi-agent calls is charged at the rate of the top-tier model involved, not stacked per agent, which addresses one of the common objections to orchestration-based systems.

One notable constraint: the API is not currently available in the EU or EEA. Sakana is working toward GDPR compliance, but the gap immediately limits the European rollout. For organizations operating under EU AI Act compliance pressure, and particularly for those whose interest in self-hosted or sovereign AI has grown with that regulatory context, the timing is inconvenient.

The Geopolitical Argument

Sakana uses the Fugu launch to make a claim that goes beyond benchmarks. The company argues that single-vendor dependency for critical AI infrastructure is a material operational risk, not a theoretical one. As evidence, it points directly to the export controls recently imposed on Anthropic's Fable and Mythos models, which restricted access overnight for some international organizations.

The argument is that orchestration systems with swappable agent pools provide a practical hedge against this concentration. If one provider restricts access, the system routes around it. Because the underlying pool can be updated as new models become available, the orchestration layer maintains consistent performance even as individual models enter or leave the pool.

This framing connects to a broader pattern in enterprise AI adoption. Organizations that have built workflows on single-model APIs have discovered that capability updates, pricing changes, and now access restrictions can disrupt those workflows without warning. An orchestration layer that abstracts the underlying model pool offers a form of continuity that no individual model API can guarantee. Whether that abstraction justifies the additional cost and complexity is a calculation each organization has to make for itself, and Sakana's own API is not yet available in key regulatory jurisdictions, which creates its own dependency questions.

Sakana plans to expand the agent pool to include open-weight models and its own models in the coming months, which would extend the sovereign AI narrative further. Open models that run on-premises reduce the international access risk considerably, though they introduce their own infrastructure requirements.

The Underlying Bet

The Fugu launch is a concrete bet that the next meaningful performance gains in AI come from systems that know how to coordinate models rather than from models trained to do everything themselves. Sakana's two ICLR 2026 papers suggest the coordination can be learned systematically rather than hand-engineered, which is the part of the claim most worth watching as independent evaluations accumulate.

The benchmarks are promising and should be treated as a starting point. Vendor-run results under ideal conditions tell you the ceiling, not the median. The more useful evidence will come from independent evaluations and from organizations running Fugu on representative slices of their own workloads. What the beta data and the technical foundation both suggest is that the ceiling is high enough to take seriously.

DAVID BORISH