Sakana Fugu: Features, Benchmarks, and How It Works

Sakana AI's Fugu orchestrates a pool of frontier LLMs behind one API. We cover the features, benchmark numbers, pricing, and real-world use cases.

Jun 24, 2026 · 12 min read

The AI landscape in mid-2026 is crowded with monolithic frontier models competing on raw benchmark scores. Sakana AI is taking a different bet: instead of training a bigger base model, it has shipped Fugu, a system that orchestrates a pool of existing frontier models behind a single OpenAI-compatible API.

The headline claim is that Sakana Fugu Ultra matches Anthropic's Fable 5 and Mythos Preview on engineering, scientific, and reasoning benchmarks, without requiring access to either of those export-controlled models.

Sakana Fugu launched on June 22, 2026, in two variants: Fugu for everyday latency-sensitive work, and Fugu Ultra for complex, multi-step tasks. On SWE-Bench Pro, Fugu Ultra scores 73.7%, ahead of Claude Opus 4.8 (69.2%), GPT-5.5 (58.6%), and Gemini 3.1 Pro (54.2%).

All benchmark numbers are Sakana-reported and have not yet been independently reproduced by third-party labs, which is worth keeping in mind as you read.

In this article, I'll cover what Fugu actually is, walk through its key features, look at the full benchmark table, and look at some use cases. You can also see our comparison of GPT-5.5 vs Gemini 3.1 Pro for context on the models Fugu is competing against.

What is Sakana Fugu?

Sakana Fugu is a language model trained to act as an orchestrator: it receives a request, decides whether to handle it directly or delegate to specialist models in its agent pool, manages verification and synthesis, and returns a single response. From the outside, you call one endpoint. On the inside, a coordinated set of models does the work.

Sakana AI has been building toward this architecture since its founding. The lab, started by Llion Jones (a co-author of the original "Attention Is All You Need" paper) and David Ha, has long argued that coordinated model ecosystems outperform isolated monoliths on hard, long-running tasks.

Fugu is the productized version of that thesis, grounded in two ICLR 2026 papers: TRINITY (an evolved LLM coordinator) and Conductor (learning to orchestrate agents in natural language).

At the time of writing, Anthropic's Fable 5 and Mythos Preview are not publicly accessible due to export controls, which means they cannot be included in Fugu's agent pool.

Sakana's argument is that routing across a swappable pool of models provides a practical hedge against exactly this kind of access disruption. Whether that constitutes genuine "AI sovereignty" is debatable, but the operational logic is sound: if one provider restricts access, Fugu routes around it.

Source

What's New With Sakana Fugu?

Fugu introduces several ideas that distinguish it from both conventional frontier models and traditional multi-agent frameworks. I've covered the key ones below.

Learned orchestration, not fixed pipelines

Most multi-agent systems require developers to define which model handles which task. Fugu learns this. The orchestrator model is trained to decide when to delegate, how agents should communicate, and how to combine their outputs into a single answer. This is the core contribution from the TRINITY and Conductor research I linked to above.

In practice, this means the complexity of a multi-agent system never reaches your code. You send a request to one endpoint, and Fugu handles model selection, delegation, verification, and synthesis internally. For teams that have tried to build their own orchestration layers, the appeal is obvious: you get the coordination gains without maintaining the harness.

One caveat worth flagging: the routing decisions are proprietary and not exposed to users. You cannot see which underlying model handled a given request. For compliance-sensitive work where you need to audit the reasoning chain, this opacity is a real limitation.

Swappable agent pool

The models in Fugu's pool are not fixed. When a new frontier model becomes publicly available, Sakana expects to spend roughly two weeks training and evaluating updated Fugu models before rolling them out. This means Fugu's performance should improve as the underlying ecosystem improves, without requiring users to change their integration.

For the standard Fugu model, users can also opt specific agents out of the pool from the console settings, which matters for teams with data privacy or compliance requirements. Fugu Ultra's pool is fixed because it relies on the full set of agents to deliver its benchmark performance.

Two-tier model design

Fugu and Fugu Ultra serve different workloads. Fugu balances quality with low latency, making it a reasonable default for coding, code review, and interactive services. Fugu Ultra coordinates a deeper pool of expert agents and is tuned for maximum answer quality on hard, multi-step problems.

Early users reached for Fugu Ultra on tasks like paper reproduction, cybersecurity analysis, Kaggle-style data science, and patent investigations.

One software engineer in the beta reported that Fugu Ultra surfaced more than twenty code review issues, whereas other tools flagged about three.

A cybersecurity engineer reported that a single scoped instruction drove a full security assessment end-to-end, including a clean report with evidence and retest steps.

I think it's worth noting here that the latency trade-off is real. The multi-agent routing and synthesis add overhead. For simple queries or tight latency requirements, a direct call to a single frontier model will likely be faster and cheaper.

OpenAI-compatible API

Both Fugu and Fugu Ultra are accessible through a single OpenAI-compatible API. No SDK migration is required.

You point your existing client at the Fugu endpoint with your API key and start sending requests. This is a deliberate choice to lower the switching cost for teams already using GPT-5.5 or Claude Opus 4.8 via the OpenAI client library.

Sakana Fugu Benchmarks

According to Sakana's comparison table, Fugu Ultra tops 10 of 11 benchmarks, with GPT-5.5 winning only MRCRv2. All scores, except Fugu's, are provider-reported. Fable 5 and Mythos Preview are excluded from the comparison because they are not publicly accessible and therefore cannot be in Fugu's agent pool.

Independent reproductions of these numbers have not yet appeared on third-party leaderboards as far as I've seen at the time of writing.

Benchmark	Fugu	Fugu Ultra	Opus 4.8 †	Gemini 3.1 Pro †	GPT 5.5 †
SWE Bench Pro *	59.0	73.7	69.2	54.2	58.6
TerminalBench 2.1	80.2	82.1	74.6	70.3	78.2
LiveCodeBench	92.9	93.2	87.8	88.5	85.3
LiveCodeBench Pro	87.8	90.8	84.8	82.9	88.4
Humanity's Last Exam	47.2	50.0	49.8	44.4	41.4
CharXiv Reasoning	85.1	86.6	84.2	83.3	84.1
GPQA-D	95.5	95.5	92.0	94.3	93.6
SciCode	60.1	58.7	53.5	58.9	56.1
τ3 Banking	21.7	20.6	20.6	8.4	20.6
Long Context Reasoning	74.7	73.3	67.7	72.7	74.3
MRCRv2	86.6	93.6	87.9	84.9	94.8

SWE-Bench Pro

Fugu Ultra scores 73.7% on SWE-Bench Pro, ahead of Claude Opus 4.8 at 69.2%, GPT-5.5 at 58.6%, and Gemini 3.1 Pro at 54.2%. In our coverage of GPT-5.5, we noted it scored 58.6% on this benchmark, which Fugu Ultra clears by 15 points.

SWE-Bench Pro tests whether a model can resolve real GitHub issues in software repositories, making it one of the more practically grounded coding benchmarks available. The gap between Fugu Ultra and the next-best publicly accessible model is large enough to matter for teams doing serious code review or bug-fixing work.

LiveCodeBench and LiveCodeBench Pro

Fugu Ultra scores 93.2% on LiveCodeBench and 90.8% on LiveCodeBench Pro. The next-best scores are Gemini 3.1 Pro at 88.5% and Claude Opus 4.8 at 87.8% on the standard version.

LiveCodeBench tests competitive programming problems drawn from recent contests, which means the training data contamination risk is lower than on older coding benchmarks. The Pro variant uses harder problems. Fugu's lead here is consistent with the beta reports of strong code review performance.

Humanity's Last Exam

Fugu Ultra scores 50.0% on Humanity's Last Exam, compared to Claude Opus 4.8 at 49.8%, Gemini 3.1 Pro at 44.4%, and GPT-5.5 at 41.4%.

This benchmark tests expert-level knowledge across scientific and academic domains, with questions designed to be difficult even for domain specialists. A score of 50% on this benchmark is notable given that GPT-5.5 sits at 41.4%. The gap between Fugu Ultra and GPT-5.5 here is the largest of any benchmark in the table.

GPQA Diamond

Both Fugu and Fugu Ultra score 95.5% on GPQA Diamond, with Gemini 3.1 Pro at 94.3%, GPT-5.5 at 93.6%, and Claude Opus 4.8 at 92.0%.

GPQA Diamond tests PhD-level science questions in biology, chemistry, and physics, written by domain experts and verified to be difficult for non-specialists. The fact that standard Fugu matches Fugu Ultra here suggests the orchestration overhead on this benchmark is not adding much, which makes sense for a question-answering task that doesn't require multi-step planning.

TerminalBench 2.1

Fugu Ultra scores 82.1% on TerminalBench 2.1, ahead of GPT-5.5 at 78.2%, Claude Opus 4.8 at 74.6%, and Gemini 3.1 Pro at 70.3%. TerminalBench tests agentic coding tasks that require interacting with a terminal environment, including file manipulation, shell commands, and multi-step execution.

This is one of the benchmarks most relevant to real developer workflows, and Fugu Ultra's lead over GPT-5.5 here is consistent with the cybersecurity assessment use case reported in the beta.

MRCRv2

GPT-5.5 wins MRCRv2 with 94.8%, compared to Fugu Ultra at 93.6% and Claude Opus 4.8 at 87.9%. MRCRv2 tests long-context recall, specifically whether a model can retrieve specific information from very long documents.

This is the one benchmark where Fugu Ultra does not lead, which is worth noting for teams whose primary use case involves retrieving specific facts from large documents rather than multi-step reasoning or code generation.

Source

Sakana Fugu Pricing and Availability

Fugu is available now through Sakana's console at console.sakana.ai, with an OpenAI-compatible API. Both Fugu and Fugu Ultra are included in every plan. The subscription tiers are:

Standard: $20/month, baseline usage, suited for occasional API calls and personal experiments
Pro: $100/month, 10x the Standard usage, suited for regular coding and research sessions
Max: $200/month, 30x the Standard usage, suited for heavy, long-running workloads

The pay-as-you-go plan bills by token usage rather than a monthly allowance. For Fugu Ultra (model ID: fugu-ultra-20260615), pricing is $5 per million input tokens and $30 per million output tokens for standard context lengths.

For contexts above 272K tokens, rates increase to $10 input and $45 output. Cached input is $0.50 per million tokens, or $1.00 above 272K. Fugu's standard pricing matches the rate of whichever underlying model is active, and Sakana does not stack fees when multiple agents are running.

One regional limitation to flag: Fugu is not currently available in EU or EEA member states, while Sakana works through GDPR compliance. For European teams, this is a blocker for now. Sakana has not published a timeline for EU availability.

Final Thoughts

Fugu is a genuinely different bet from what OpenAI, Anthropic, and Google are doing. Rather than competing on base model scale, Sakana is arguing that learned orchestration across a swappable pool of frontier models can match or beat any individual system on the tasks that actually take time: long code reviews, multi-step research, security assessments, and paper reproduction.

The benchmark numbers, if they hold up to independent verification, support that argument on most of the metrics that matter for developers. Of course, until we get access again to Fable 5 and Mythos, we can't compare directly to these.

That being said, there are certainly some caveats.

All benchmark scores are Sakana-reported. The routing layer is opaque, which creates friction for compliance-sensitive work. The EU/EEA exclusion limits adoption for a significant portion of the global developer community. And the latency trade-off on simple tasks is genuine: if you're making quick, single-turn API calls, Fugu Ultra's orchestration overhead is pure cost with no benefit.

Where I think Fugu has a real case is for teams running long-horizon agentic workflows who are already managing the complexity of multi-model setups themselves. If you're maintaining your own orchestration harness on top of Claude and GPT, Fugu Ultra is worth a serious evaluation.

The sovereignty framing is probably oversold, but the practical value of a single endpoint that routes around provider disruptions is not.

If you want to get up to speed on the agentic AI landscape that Fugu is operating in, I'd recommend starting with the AI Agent Fundamentals skill track on DataCamp, which covers the foundations of large language models, agents, and multi-model systems.

What is the maximum context window for Fugu Ultra?

Does Fugu support web browsing or external tool usage?

How does Fugu relate to Sakana AI's "Marlin" agent?

Can the hosted Fugu API interact directly with my local filesystem or code editor?

Author

Matt Crabtree

Topics

Artificial Intelligence

Large Language Models

Top DataCamp Courses

Track

AI Agent Fundamentals

6 hr

Discover how AI agents can change how you work and deliver value for your organization!

See Details

Start Course

Course

Designing Agentic Systems with LangChain

3 hr

12.2K

Get to grips with the foundational components of LangChain agents and build custom chat agents.

See Details

Start Course

Course

Google: Agent Fundamentals

1 hr

147

Learn AI agent fundamentals — how they differ from LLMs, when to use them, and explore agent architecture, orchestration, and tools.

See Details

Start Course

blog

Claude 3.7 Sonnet: Features, Access, Benchmarks & More

Learn about Claude 3.7 Sonnet's hybrid approach of combining reasoning mode and generalist mode, key benchmarks, and how to access it via web or API.

Alex Olteanu

8 min

Strawberry coding on a computer, representing OpenAI’s o3 innovations

blog

OpenAI’s O3: Features, O1 Comparison, Benchmarks & More

Learn about OpenAI’s o3 and o3 mini, including their key features, ARC AGI breakthroughs, and safety innovations like deliberative alignment.

Alex Olteanu

8 min

blog

Meta's Llama 4: Features, Access, How It Works, and More

Learn about the Llama 4 suite of large language models, including Llama 4 Scout, Llama 4 Maverick, and the in-training Llama 4 Behemoth.

Alex Olteanu

8 min

blog

Claude Haiku 4.5: Features, Testing Results, and Use Cases

Discover Anthropic’s new Haiku 4.5 model release. Explore new features, testing results, and use cases.

Srujana Maddula

10 min

blog

Claude Mythos 5: Features, Benchmarks, and What It Can Do

Anthropic's most capable model yet, Claude Mythos 5 brings Mythos-class AI to cybersecurity, drug design, and scientific research with the safeguards lifted for trusted partners.

Tom Farnschläder

11 min

Tutorial

LLM Benchmarks Explained: A Guide to Comparing the Best AI Models

Cut through the hype. Learn to interpret LLM benchmarks, navigate open leaderboards, and run your own evaluations to find the best AI models for your needs.

Bex Tuychiev

See More See More

What is Sakana Fugu?

What's New With Sakana Fugu?

Learned orchestration, not fixed pipelines

Swappable agent pool

Two-tier model design

OpenAI-compatible API

Sakana Fugu Benchmarks

SWE-Bench Pro

LiveCodeBench and LiveCodeBench Pro

Humanity's Last Exam

GPQA Diamond

TerminalBench 2.1

MRCRv2

Sakana Fugu Pricing and Availability

Final Thoughts

Sakana Fugu FAQs

How does Fugu relate to Sakana AI's "Marlin" agent?

Can the hosted Fugu API interact directly with my local filesystem or code editor?

Claude 3.7 Sonnet: Features, Access, Benchmarks & More

OpenAI’s O3: Features, O1 Comparison, Benchmarks & More

Meta's Llama 4: Features, Access, How It Works, and More

Claude Haiku 4.5: Features, Testing Results, and Use Cases

Claude Mythos 5: Features, Benchmarks, and What It Can Do

LLM Benchmarks Explained: A Guide to Comparing the Best AI Models

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}AI Agent Fundamentals

Designing Agentic Systems with LangChain

Google: Agent Fundamentals

Claude 3.7 Sonnet: Features, Access, Benchmarks & More

OpenAI’s O3: Features, O1 Comparison, Benchmarks & More

Meta's Llama 4: Features, Access, How It Works, and More

Claude Haiku 4.5: Features, Testing Results, and Use Cases

Claude Mythos 5: Features, Benchmarks, and What It Can Do

LLM Benchmarks Explained: A Guide to Comparing the Best AI Models

AI Agent Fundamentals