Track
The AI landscape in mid-2026 is crowded with monolithic frontier models competing on raw benchmark scores. Sakana AI is taking a different bet: instead of training a bigger base model, it has shipped Fugu, a system that orchestrates a pool of existing frontier models behind a single OpenAI-compatible API.
The headline claim is that Sakana Fugu Ultra matches Anthropic's Fable 5 and Mythos Preview on engineering, scientific, and reasoning benchmarks, without requiring access to either of those export-controlled models.
Sakana Fugu launched on June 22, 2026, in two variants: Fugu for everyday latency-sensitive work, and Fugu Ultra for complex, multi-step tasks. On SWE-Bench Pro, Fugu Ultra scores 73.7%, ahead of Claude Opus 4.8 (69.2%), GPT-5.5 (58.6%), and Gemini 3.1 Pro (54.2%).
All benchmark numbers are Sakana-reported and have not yet been independently reproduced by third-party labs, which is worth keeping in mind as you read.
In this article, I'll cover what Fugu actually is, walk through its key features, look at the full benchmark table, and look at some use cases. You can also see our comparison of GPT-5.5 vs Gemini 3.1 Pro for context on the models Fugu is competing against.
What is Sakana Fugu?
Sakana Fugu is a language model trained to act as an orchestrator: it receives a request, decides whether to handle it directly or delegate to specialist models in its agent pool, manages verification and synthesis, and returns a single response. From the outside, you call one endpoint. On the inside, a coordinated set of models does the work.
Sakana AI has been building toward this architecture since its founding. The lab, started by Llion Jones (a co-author of the original "Attention Is All You Need" paper) and David Ha, has long argued that coordinated model ecosystems outperform isolated monoliths on hard, long-running tasks.
Fugu is the productized version of that thesis, grounded in two ICLR 2026 papers: TRINITY (an evolved LLM coordinator) and Conductor (learning to orchestrate agents in natural language).
At the time of writing, Anthropic's Fable 5 and Mythos Preview are not publicly accessible due to export controls, which means they cannot be included in Fugu's agent pool.
Sakana's argument is that routing across a swappable pool of models provides a practical hedge against exactly this kind of access disruption. Whether that constitutes genuine "AI sovereignty" is debatable, but the operational logic is sound: if one provider restricts access, Fugu routes around it.
What's New With Sakana Fugu?
Fugu introduces several ideas that distinguish it from both conventional frontier models and traditional multi-agent frameworks. I've covered the key ones below.
Learned orchestration, not fixed pipelines
Most multi-agent systems require developers to define which model handles which task. Fugu learns this. The orchestrator model is trained to decide when to delegate, how agents should communicate, and how to combine their outputs into a single answer. This is the core contribution from the TRINITY and Conductor research I linked to above.
In practice, this means the complexity of a multi-agent system never reaches your code. You send a request to one endpoint, and Fugu handles model selection, delegation, verification, and synthesis internally. For teams that have tried to build their own orchestration layers, the appeal is obvious: you get the coordination gains without maintaining the harness.
One caveat worth flagging: the routing decisions are proprietary and not exposed to users. You cannot see which underlying model handled a given request. For compliance-sensitive work where you need to audit the reasoning chain, this opacity is a real limitation.
Swappable agent pool
The models in Fugu's pool are not fixed. When a new frontier model becomes publicly available, Sakana expects to spend roughly two weeks training and evaluating updated Fugu models before rolling them out. This means Fugu's performance should improve as the underlying ecosystem improves, without requiring users to change their integration.
For the standard Fugu model, users can also opt specific agents out of the pool from the console settings, which matters for teams with data privacy or compliance requirements. Fugu Ultra's pool is fixed because it relies on the full set of agents to deliver its benchmark performance.
Two-tier model design
Fugu and Fugu Ultra serve different workloads. Fugu balances quality with low latency, making it a reasonable default for coding, code review, and interactive services. Fugu Ultra coordinates a deeper pool of expert agents and is tuned for maximum answer quality on hard, multi-step problems.
Early users reached for Fugu Ultra on tasks like paper reproduction, cybersecurity analysis, Kaggle-style data science, and patent investigations.
One software engineer in the beta reported that Fugu Ultra surfaced more than twenty code review issues, whereas other tools flagged about three.
A cybersecurity engineer reported that a single scoped instruction drove a full security assessment end-to-end, including a clean report with evidence and retest steps.
I think it's worth noting here that the latency trade-off is real. The multi-agent routing and synthesis add overhead. For simple queries or tight latency requirements, a direct call to a single frontier model will likely be faster and cheaper.
OpenAI-compatible API
Both Fugu and Fugu Ultra are accessible through a single OpenAI-compatible API. No SDK migration is required.
You point your existing client at the Fugu endpoint with your API key and start sending requests. This is a deliberate choice to lower the switching cost for teams already using GPT-5.5 or Claude Opus 4.8 via the OpenAI client library.
Sakana Fugu Benchmarks
According to Sakana's comparison table, Fugu Ultra tops 10 of 11 benchmarks, with GPT-5.5 winning only MRCRv2. All scores, except Fugu's, are provider-reported. Fable 5 and Mythos Preview are excluded from the comparison because they are not publicly accessible and therefore cannot be in Fugu's agent pool.
Independent reproductions of these numbers have not yet appeared on third-party leaderboards as far as I've seen at the time of writing.
| Benchmark | Fugu | Fugu Ultra | Opus 4.8 † | Gemini 3.1 Pro † | GPT 5.5 † |
|---|---|---|---|---|---|
| SWE Bench Pro * | 59.0 | 73.7 | 69.2 | 54.2 | 58.6 |
| TerminalBench 2.1 | 80.2 | 82.1 | 74.6 | 70.3 | 78.2 |
| LiveCodeBench | 92.9 | 93.2 | 87.8 | 88.5 | 85.3 |
| LiveCodeBench Pro | 87.8 | 90.8 | 84.8 | 82.9 | 88.4 |
| Humanity's Last Exam | 47.2 | 50.0 | 49.8 | 44.4 | 41.4 |
| CharXiv Reasoning | 85.1 | 86.6 | 84.2 | 83.3 | 84.1 |
| GPQA-D | 95.5 | 95.5 | 92.0 | 94.3 | 93.6 |
| SciCode | 60.1 | 58.7 | 53.5 | 58.9 | 56.1 |
| τ3 Banking | 21.7 | 20.6 | 20.6 | 8.4 | 20.6 |
| Long Context Reasoning | 74.7 | 73.3 | 67.7 | 72.7 | 74.3 |
| MRCRv2 | 86.6 | 93.6 | 87.9 | 84.9 | 94.8 |
SWE-Bench Pro
Fugu Ultra scores 73.7% on SWE-Bench Pro, ahead of Claude Opus 4.8 at 69.2%, GPT-5.5 at 58.6%, and Gemini 3.1 Pro at 54.2%. In our coverage of GPT-5.5, we noted it scored 58.6% on this benchmark, which Fugu Ultra clears by 15 points.
SWE-Bench Pro tests whether a model can resolve real GitHub issues in software repositories, making it one of the more practically grounded coding benchmarks available. The gap between Fugu Ultra and the next-best publicly accessible model is large enough to matter for teams doing serious code review or bug-fixing work.
LiveCodeBench and LiveCodeBench Pro
Fugu Ultra scores 93.2% on LiveCodeBench and 90.8% on LiveCodeBench Pro. The next-best scores are Gemini 3.1 Pro at 88.5% and Claude Opus 4.8 at 87.8% on the standard version.
LiveCodeBench tests competitive programming problems drawn from recent contests, which means the training data contamination risk is lower than on older coding benchmarks. The Pro variant uses harder problems. Fugu's lead here is consistent with the beta reports of strong code review performance.
Humanity's Last Exam
Fugu Ultra scores 50.0% on Humanity's Last Exam, compared to Claude Opus 4.8 at 49.8%, Gemini 3.1 Pro at 44.4%, and GPT-5.5 at 41.4%.
This benchmark tests expert-level knowledge across scientific and academic domains, with questions designed to be difficult even for domain specialists. A score of 50% on this benchmark is notable given that GPT-5.5 sits at 41.4%. The gap between Fugu Ultra and GPT-5.5 here is the largest of any benchmark in the table.
GPQA Diamond
Both Fugu and Fugu Ultra score 95.5% on GPQA Diamond, with Gemini 3.1 Pro at 94.3%, GPT-5.5 at 93.6%, and Claude Opus 4.8 at 92.0%.
GPQA Diamond tests PhD-level science questions in biology, chemistry, and physics, written by domain experts and verified to be difficult for non-specialists. The fact that standard Fugu matches Fugu Ultra here suggests the orchestration overhead on this benchmark is not adding much, which makes sense for a question-answering task that doesn't require multi-step planning.
TerminalBench 2.1
Fugu Ultra scores 82.1% on TerminalBench 2.1, ahead of GPT-5.5 at 78.2%, Claude Opus 4.8 at 74.6%, and Gemini 3.1 Pro at 70.3%. TerminalBench tests agentic coding tasks that require interacting with a terminal environment, including file manipulation, shell commands, and multi-step execution.
This is one of the benchmarks most relevant to real developer workflows, and Fugu Ultra's lead over GPT-5.5 here is consistent with the cybersecurity assessment use case reported in the beta.
MRCRv2
GPT-5.5 wins MRCRv2 with 94.8%, compared to Fugu Ultra at 93.6% and Claude Opus 4.8 at 87.9%. MRCRv2 tests long-context recall, specifically whether a model can retrieve specific information from very long documents.
This is the one benchmark where Fugu Ultra does not lead, which is worth noting for teams whose primary use case involves retrieving specific facts from large documents rather than multi-step reasoning or code generation.
Sakana Fugu Pricing and Availability
Fugu is available now through Sakana's console at console.sakana.ai, with an OpenAI-compatible API. Both Fugu and Fugu Ultra are included in every plan. The subscription tiers are:
- Standard: $20/month, baseline usage, suited for occasional API calls and personal experiments
- Pro: $100/month, 10x the Standard usage, suited for regular coding and research sessions
- Max: $200/month, 30x the Standard usage, suited for heavy, long-running workloads
The pay-as-you-go plan bills by token usage rather than a monthly allowance. For Fugu Ultra (model ID: fugu-ultra-20260615), pricing is $5 per million input tokens and $30 per million output tokens for standard context lengths.
For contexts above 272K tokens, rates increase to $10 input and $45 output. Cached input is $0.50 per million tokens, or $1.00 above 272K. Fugu's standard pricing matches the rate of whichever underlying model is active, and Sakana does not stack fees when multiple agents are running.
One regional limitation to flag: Fugu is not currently available in EU or EEA member states, while Sakana works through GDPR compliance. For European teams, this is a blocker for now. Sakana has not published a timeline for EU availability.
Final Thoughts
Fugu is a genuinely different bet from what OpenAI, Anthropic, and Google are doing. Rather than competing on base model scale, Sakana is arguing that learned orchestration across a swappable pool of frontier models can match or beat any individual system on the tasks that actually take time: long code reviews, multi-step research, security assessments, and paper reproduction.
The benchmark numbers, if they hold up to independent verification, support that argument on most of the metrics that matter for developers. Of course, until we get access again to Fable 5 and Mythos, we can't compare directly to these.
That being said, there are certainly some caveats.
All benchmark scores are Sakana-reported. The routing layer is opaque, which creates friction for compliance-sensitive work. The EU/EEA exclusion limits adoption for a significant portion of the global developer community. And the latency trade-off on simple tasks is genuine: if you're making quick, single-turn API calls, Fugu Ultra's orchestration overhead is pure cost with no benefit.
Where I think Fugu has a real case is for teams running long-horizon agentic workflows who are already managing the complexity of multi-model setups themselves. If you're maintaining your own orchestration harness on top of Claude and GPT, Fugu Ultra is worth a serious evaluation.
The sovereignty framing is probably oversold, but the practical value of a single endpoint that routes around provider disruptions is not.
If you want to get up to speed on the agentic AI landscape that Fugu is operating in, I'd recommend starting with the AI Agent Fundamentals skill track on DataCamp, which covers the foundations of large language models, agents, and multi-model systems.
Sakana Fugu FAQs
What is the maximum context window for Fugu Ultra?
Fugu Ultra supports a massive maximum context window of 1,000,000 tokens. Note that while standard API pricing applies to the first 272,000 tokens, the rates increase for contexts that exceed that threshold.
Does Fugu support web browsing or external tool usage?
Yes. Because Fugu uses a standard OpenAI-compatible API, you can seamlessly point existing AI coding assistants or API clients at the Sakana endpoint. Additionally, Sakana provides a direct, one-line install command to integrate the model natively into Codex, allowing developers to leverage multi-agent orchestration right inside their code editor.
How does Fugu relate to Sakana AI's "Marlin" agent?
While both are multi-agent AI systems built by Sakana, they serve entirely different workflows. Marlin (launched just a week prior to Fugu) is an autonomous "ultra deep research" agent designed to run asynchronously for up to eight hours to generate 100-page strategy reports. Fugu, by contrast, is an orchestration API meant for immediate interaction, synthesizing responses across frontier models for lower-latency coding and reasoning tasks.
Can the hosted Fugu API interact directly with my local filesystem or code editor?
No. Because Fugu is a hosted orchestration layer accessed via a remote endpoint, it cannot natively touch your local file system, run shell commands directly on your machine, or commit code to your local repositories. While it scores highly on terminal-based benchmarks, you will still need local execution agents or IDE integrations (like Cursor or Claude Code) on your side of the API to execute those physical file changes.
A senior editor in the AI and edtech space. Committed to exploring data and AI trends.




