Share this webinar
Close your data and AI skills gap
We're the only platform uniquely engineered to advance data and AI skills across your entire organization. Let's explore a tailored program.
Book an Enterprise DemoThe Four Gaps Between Demo Agents and Production Systems
May 2026Your Presenter(s)

Yuval Belfer
Senior Developer Advocate at AI21 Labs
Yuval helps clients understand how to use AI21 Labs's enterprise AI tools. He's also a Senior Lecturer in LLM development at Reichman Tech School, and an organizer of the AI Tinkerers community. Previously, Yubval was a System-on-Chip Design Engineer at AWS. He hosts YAAP (Yet Another AI Podcast).
Summary
Demo agents and production AI agents fail in four predictable ways, and the gaps between them are now understood well enough to close systematically.
Yuval Belfer, senior developer advocate at AI21 Labs, walked DataCamp's Agents in the Enterprise audience through the gulf between AI prototypes that demo well and production AI agents that run reliably at scale. He argued that the much-cited MIT statistic, that 95% of AI prototypes never reach production, is no longer the right framing. Agents are reaching production. The harder question is what they cost once they get there.
Belfer organized the talk around four gaps. The validation gap is the difference between success on a single attempt and success across multiple sampled attempts. The contextualization gap is the cost of sending every query to the same frontier model when a smaller one would suffice. The latency gap is the difference between average runtime and the shortest successful path. The decomposition gap is the difference between agents that follow a single linear path and agents that branch, evaluate, and prune like a tree search.
Closing those gaps requires a portfolio approach: multiple models, multiple execution strategies, and a runtime layer that decides which combination to use under specific cost and latency constraints. Belfer introduced Maestro, AI21's orchestration system for LLM model routing and AI agent cost optimization, and showed Pareto curves from the BrowseComp Plus benchmark where ensembling and routing pushed accuracy past frontier-model baselines.
Key Takeaways
- The "95% of AI prototypes fail" framing is outdated; production AI agents now exist at scale, and the open question is what they cost to operate.
- Sampling a smaller model 16 times can outperform a single run of a larger model on hard reasoning and coding tasks.
- Roughly half of common benchmark tasks can be solved by smaller, cheaper models, so routing easy queries away from frontier models cuts cost without sacrificing quality.
- Optimizing for latency and optimizing for cost demand opposite tactics: cost-first runs sequentially with cheap-first escalation, latency-first runs candidates in parallel and terminates on first success.
- ReAct-style agents that decide one step at a time finish slower than agents that produce a high-level plan upfront, because better planning reduces backtracking later.
- Most production agents move in a straight line through a problem; tree-search decomposition that branches into independent sub-problems finishes faster and cheaper.
- With five models, three prompts, four tools, and a few execution strategies, the configuration space is roughly 8,000 points, too large to test at runtime.
- Open-source models including Qwen, MiniMax, and Kimi are strong enough to combine into ensembles that approach frontier-lab performance at lower cost.
Deep Dives
The validation gap: pass@1 versus pass@n
Belfer opened by pulling apart the standard leaderboard. Most leaderboards report a single average score across thousands of test problems, with no error bars and no view of token spend. That number hides the question production teams care about: what happens if the model gets to try the same problem more than once?
This is the validation gap, sometimes called the oracle gap. Success@1 is whether the model gets the answer right on its first attempt. Success@n is whether at least one of n sampled attempts is correct. On hard reasoning and coding tasks, the gap between the two can run 15 to 20 percentage points. As Belfer put it, "sometimes you can get to the answer in a cheaper way by just running 16 times smaller model than run one time a large model."
That counterintuitive result — sampling a smaller model many times beats a single run of a larger one — is what makes the validation gap exploitable. Belfer showed a chart from a Trey Research run on coding agents with three lines: oracle (the best of n samples), adversarial (the worst), and average. The oracle line sits well above the average across all sample sizes, and the area between them is the room for improvement teams leave on the table when they only run a model once.
He shared an internal AI21 result on SWE-Bench. With four runs of GPT-5 mini, the system beat a single run of GPT-5 outright, at lower cost. The same effect appears in retrieval: indexing a corpus at multiple chunk sizes and selecting the best result per query produces an oracle line well above any single chunk-size baseline. As Belfer summarized, "if the models don't get it right on the first try, they may get it right on the second or the third try."
The catch is that closing the validation gap requires a reducer or judge component that picks the best output from n samples. That validator is hard to build, but Belfer's argument is that this is where the largest accuracy gains live, far more than what prompt tweaks or marginal model upgrades produce.
The contextualization gap: LLM model routing across a portfolio
Average leaderboard scores hide a second pattern. When you look at a benchmark like SWE-Bench problem by problem rather than as a single number, roughly half of all samples are solved by every model tested, including the smaller, cheaper ones. As Belfer put it, "you don't really need a hammer for everything."
That visualization frames the contextualization gap. If easy queries route to a frontier model, you pay frontier prices for work a smaller model could handle. Belfer's example: a customer support agent fielding "what's your refund policy?" does not need the same model that answers "explain why my enterprise contract renewal pricing changed and whether the new structure violates section 4.2 of my agreement." The first is small-RAG territory. The second needs strong reasoning over retrieved data.
The fix is a portfolio of models with a router or orchestrator that picks per query, choosing not just for capability but for cost and speed at the same time. Belfer noted this differs from the simpler routing pattern of "small model for easy tasks, big model for hard ones," because it adds budget constraints and latency targets to the routing decision itself, which is how serious LLM model routing for AI agent cost optimization works in practice. In his words, "instead of just picking one model to use, you want to use a portfolio."
Portfolios also improve quality, not just cost. Belfer showed a Venn diagram across three models with different retrieval methods on BrowseComp Plus. Forty-six percent of examples were solved by all three, but each model, including the weakest one (MiniMax), uniquely solved a slice no other model could. Ensembling those models pushed accuracy past the previous state-of-the-art benchmark of around 90%.
He addressed open source directly. Qwen, MiniMax, and Kimi are strong enough now that some combination of them can match what frontier labs ship, with a smaller gap than a year ago and far lower cost. The portfolio approach works because the models are diverse, not because any single one is best.
The latency gap: planning trades upfront cost for end-to-end speed
Latency and cost are often treated as the same axis. Belfer's third gap is that they are not, and optimizing for them sometimes pulls in opposite directions. "Optimizing for latency and… optimizing for cost, it's not the same thing," he said.
To minimize cost, you run candidates sequentially, evaluate after each, and stop when one works. To minimize latency, you run candidates in parallel and terminate everything else as soon as one succeeds, paying for compute you will never use in exchange for a faster answer.
The latency gap itself is the difference between how long a system takes on average and how long it could take if it followed the shortest successful path. On hard reasoning tasks, the shortest successful trajectory runs three times faster than the average attempt, which is the headroom available to a system that can find that path.
Belfer's most concrete example pitted two agent architectures against each other on SWE-Bench: a ReAct loop that decides one step at a time, and a high-level plan agent that produces a full plan upfront before executing. The high-level plan agent spent more time in the early "context building" phase, which sounds like a latency loss. But as Belfer put it, "good planning in the beginning saves us time later, whereas the ReAct loop spends a lot of time on fixing."
Across the entire run, the planned approach finished faster. ReAct's per-step decisions were quick individually, but the loop kept hitting dead ends that required backtracking and fixing. The lesson is that latency optimization is not a greedy local problem. Sometimes you spend more time at one step to spend less time across the whole run.
The decomposition gap: agents that branch like a tree search
The first three gaps live at the system level, around the agent. The fourth lives inside it. "Right now, most agents work like a straight line," Belfer said.
Most production AI agents follow a single planned path. Even when they have a plan, they do not stop at each step to consider whether the path is going badly or whether a different branch would be more promising. Belfer described the alternative as Monte Carlo Tree Search-style execution: at each decision point, the agent generates several candidate approaches, evaluates them, prunes the weak ones, and continues down the most promising branch.
His worked example was a coding agent fixing a complex bug that turned out to involve two independent sub-problems, one in the database layer and one in the API. A linear agent tackles them in sequence, taking around eight minutes. A decomposition-aware agent recognizes the independence, dispatches two parallel branches, picks the cheapest viable model for each, validates them independently, and merges. Three minutes for the same output, at lower cost because each sub-problem used a cheaper model.
The decomposition gap is harder to close than the other three because it requires the agent to reason about problem structure inside its own loop, not just at the system boundary. Belfer called it "frontier territory." Most production systems do not close this gap at all, and the ones that try are working at the edge of what is currently practical.
What ties decomposition back to the other gaps is that it applies the same techniques recursively. Validation, contextualization, and latency strategies all become more powerful when the agent can apply them at every node of an internal search tree, not just at the boundary of a single call.
Closing the gaps with Maestro and the Pareto frontier
The optimization problem balloons quickly. Five models, three prompts, four tools, vertical scaling across five thinking budgets, horizontal scaling across three parallel branches with three execution strategies. Belfer's count came to roughly 8,000 configuration points per task. "It's runtime. You can't try everything all at once," he said.
You cannot sweep that space exhaustively at runtime the way you would tune hyperparameters during training. The model lineup changes weekly, prices change, task distributions drift. Belfer framed the Pareto frontier, the set of configurations that represent the best achievable accuracy for a given cost or latency budget, as the right object to optimize against. The goal is not to find a single best configuration but to know the curve and pick the operating point that matches current constraints.
Maestro is AI21's answer. It splits into two phases. At build time, it learns the cost, latency, and success characteristics of different configurations on a given task by running on a small training set. Belfer cited 100 examples for the BrowseComp Plus run. At runtime, given the customer's cost and latency constraints for a specific request, it decides what to run, in what order, when to branch, when to escalate, and when to stop early. As he put it, "the same agent handling the same ticket query can run at very different cost latency points depending on how you answer these four questions."
The principles Belfer laid out at the end of the talk follow from the gaps directly. Optimization should be automatic because developers cannot manually tune every new model release. It should be efficient: you can add overhead in exchange for time saved, but you cannot add time. It should provide visibility into the Pareto curve so teams can pick their operating point. And it should be future-proof, adapting whenever a new model arrives without another round of trial-and-error.
관련된
webinar
AI In The Enterprise: From Prototype to Production
Aishwarya Naresh Reganti, Supreet Kaur, and Luke Jinu Kim discuss how to navigate the journey from AI prototypes to production-ready applications.webinar
Bridging the Generative AI Talent Gap
Adel Nehme, VP of Media, discusses the state of the AI skills gap, introducing a comprehensive framework for upskilling all learning personas within your organization.webinar
Make AI Work More Than 5% of the Time
Industry experts discuss what separates successful AI implementations from the 95% that never make it.webinar
Using AI To Increase Your Productivity
Industry experts share real-world examples of how professionals across these fields are using AI to get more done with less effort.webinar
Lessons from AI to Improve Your Performance
Judah, Managing Partner at Hetz Ventures, shares real-world examples for managers and individual contributors of how to overcome your limits.webinar
