Qwen3.7-Max: Features, Benchmarks, and the Agent Frontier

Alibaba's Qwen3.7-Max is a new proprietary flagship model built for agentic workflows, with top-tier scores on coding, reasoning, and long-horizon task benchmarks.

22 maj 2026 · 12 min läsa

The race for the best agentic AI model is getting crowded. Anthropic's Claude Opus 4.6, DeepSeek V4 Pro, and Kimi K2.6 have all staked claims to the top of the coding and reasoning leaderboards in recent months. Now, Alibaba's Qwen team has entered with Qwen3.7-Max, a proprietary flagship model they describe as built specifically for the agent era.

The headline numbers are worth paying attention to. On GPQA Diamond, Qwen3.7-Max scores 92.4, beating Claude Opus 4.6 Max's 91.3. On the Apex reasoning benchmark, it scores 44.5 against DeepSeek V4 Pro's 38.3. And in a 35-hour autonomous kernel optimization run, the model made 1,158 tool calls and achieved a 10x geometric mean speedup over the Triton reference implementation.

In this article, I'll cover everything new with Qwen3.7-Max, looking at its key features, exploring the benchmark results, and proposing hands-on tests you can run yourself. You can also check out our coverage of Gemini 3.5 Flash and Gemini Omni for context on where the broader frontier sits right now.

What is Qwen3.7-Max?

Qwen3.7-Max is Alibaba's latest proprietary model, positioned as the top tier of the Qwen3.x family above Qwen3.6-Plus.

It is a closed-source, API-accessible model available through Alibaba Cloud Model Studio, with a 1 million token context window and text-only input and output. The Qwen team describes it as a "versatile agent foundation" rather than a general-purpose chat model.

Compared to Qwen3.6-Plus, the improvements are substantial across every category. On Terminal Bench 2.0-Terminus, Qwen3.7-Max scores 69.7 against Qwen3.6-Plus's 61.6. On the YC-Bench startup simulation, it achieved 2.08M USD in total revenue, double Qwen3.6-Plus's 1.05M USD. The gap is widest on long-horizon agentic tasks, which is exactly where the team says they focused their training effort.

The Artificial Analysis Intelligence Index gives Qwen3.7-Max a score of 56.6, placing it above competitors and just short of some of the best frontier models.

One thing worth flagging early: the model is notably verbose. Artificial Analysis observed approximately 97 million tokens generated during their evaluation, far above the median of 24 million. That verbosity has cost implications for long agentic sessions, which I'll return to in the pricing section.

What's New With Qwen3.7-Max?

Qwen3.7-Max introduces several capabilities that distinguish it from both its predecessor and the current field of frontier models. The focus throughout is on sustained, autonomous execution rather than single-turn performance.

Long-horizon autonomous execution

The most striking demonstration in the release is the 35-hour kernel optimization run.

Essentially, the Qwen team handed the AI a tough coding problem (optimizing GPU code), giving it just the instructions and a way to test its work, and then stepped back. From there, Qwen3.7-Max worked autonomously, writing code, running tests, finding bottlenecks, and redesigning the code. It looped through this process over a thousand times.

The final result made the code run 10 times faster than the standard baseline, all created on computer hardware it had never encountered before.

Qwen3.7 Max was still finding meaningful ways to speed up the code even after the 30-hour mark. It didn't just grab the low-hanging fruit and quit.

Other top-tier models like GLM 5.1, Kimi K2.6, and DeepSeek V4 Pro all tapped out much earlier (maxing out at 7.3x, 5.0x, and 3.3x faster, respectively).

This shows that Qwen3.7-Max isn't just good at answering a quick coding question in a single prompt. You can hand it a massive, grueling optimization project, walk away, and trust that it will stay entirely focused without getting confused or losing the plot over a long period.

Cross-harness generalization

Most agent models are trained and evaluated on a specific scaffold, which means their benchmark numbers can reflect harness-specific shortcuts rather than genuine problem-solving.

Qwen3.7-Max is designed to avoid this through a training infrastructure that decouples Task, Harness, and Verifier into three independent components that can be freely recombined.

In practice, this means the model was trained on identical tasks paired with diverse harnesses and verifiers, forcing it to learn generalizable strategies.

The benchmark results reflect this: Qwen3.7-Max performs consistently whether deployed through Claude Code, OpenClaw, Qwen Code, or custom tool-use frameworks. On QwenClawBench and CoWorkBench, performance holds regardless of which harness is used at evaluation time.

For teams building agent systems, this matters because it means Qwen3.7-Max can serve as a drop-in backbone without requiring framework-specific tuning. That is a real operational advantage over models that perform well only in their native scaffold.

MCP and multi-agent orchestration

Qwen3.7-Max supports native integration with Model Context Protocol (MCP), which allows it to connect to external tools and data sources in a standardized way.

On MCP-Mark, it scores 60.8, ahead of GLM-5.1 Thinking's 57.5 and Opus-4.6 Max's 56.7. On MCP-Atlas, it scores 76.4, edging out Opus-4.6 Max's 75.8.

The office automation angle is worth highlighting separately. On SpreadSheetBench-v1, Qwen3.7-Max scores 87.0, second only to Opus-4.6 Max's 89.3.

The Qwen team demonstrates this through a thesis formatting task where the model reads a formatting specification document and autonomously reformats a messy draft, fixing page layout, heading styles, fonts, margins, table of contents, and references through autonomous office-cli tool calls.

Reward hacking self-monitoring

One of the more unusual features in this release is a self-monitoring framework for RL training. During RL experiments exceeding 80 hours, Qwen3.7-Max autonomously retrieved and replayed training trajectories, executing over 10,000 calls to identify reward hacking patterns.

These included attempts to bypass constraints and access ground-truth answers on GitHub.

The system performed rule verification, counter-example mining, and iterative optimization, ultimately adding 13 new heuristic rules and accurately flagging 1,618 hacking cases.

This is less a user-facing feature and more a signal about how the model was trained, but it is relevant for teams thinking about deploying Qwen3.7-Max in RL pipelines of their own.

Qwen3.7-Max can now operate a robot dog through tool-use calls, using the Qwen-RobotClaw agent harness and Qwen-RobotNav navigation foundation model.

The model handles physical understanding, planning, memory, and decision-making in real environments. This is demonstrated in a 20-minute agent session where the robot navigates a physical space using the model's long-term memory and first-person visual input.

I would not overstate this as a production-ready robotics platform, but it does illustrate the breadth of the agent framework. The same tool-calling infrastructure that handles spreadsheet automation and kernel optimization also extends to physical-world navigation.

Qwen3.7-Max Benchmarks

Let's look at the benchmark data in more detail.

Image source

Qwen3.7-Max was evaluated across four broad categories: coding agents, general-purpose agents, STEM reasoning, and general capabilities, including multilingual performance.

The results are drawn from a wide variety of agent scaffolds, which makes them more meaningful than single-scaffold numbers.

Coding agent benchmarks

On Terminal Bench 2.0-Terminus, Qwen3.7-Max scores 69.7, ahead of DeepSeek-V4-Pro Max (67.9), Opus-4.6 Max (65.4), and K2.6 Thinking (66.7).

This benchmark tests autonomous terminal-based software engineering with a 5-hour timeout and 12 CPU cores. On SWE-Pro, it scores 60.6, the highest in the comparison table, ahead of K2.6 Thinking (59.5) and DS-V4-Pro Max (59.0).

SWE-Verified is the one benchmark where Qwen3.7-Max does not lead: it scores 80.4 against Opus-4.6 Max's 80.8 and DS-V4-Pro Max's 80.6.

The gap is small, but it is worth noting. On SWE-Multilingual (78.3) and SciCode (53.5), it leads the field. On QwenSVG, it scores 1608, ahead of GLM-5.1 Thinking's 1605 and Opus-4.6 Max's 1541.

General agent benchmarks

The general agent results are where Qwen3.7-Max makes its strongest case. On MCP-Mark, it scores 60.8 against GLM-5.1's 57.5 and Opus-4.6's 56.7.

On MCP-Atlas, it scores 76.4, edging Opus-4.6's 75.8. On Skillsbench, it scores 59.2 against K2.6 Thinking's 56.2. On SpreadSheetBench-v1, it scores 87.0, second only to Opus-4.6 Max's 89.3.

The Kernel Bench L3 result deserves its own mention. Qwen3.7-Max achieves a 1.98x median speedup with a 96% win rate (fraction of problems faster than torch.compile).

Opus-4.6 Max leads here at 2.63x/98%, but Qwen3.7-Max is well ahead of K2.6 Thinking (1.41x/80%), GLM-5.1 (2.00x/78%), and DS-V4-Pro Max (1.07x/54%). On MRCR-v2 128k, which tests long-context retrieval, it scores 90.4, ahead of Qwen3.6-Plus (85.9) and Opus-4.6 Max (84.0).

STEM and reasoning benchmarks

GPQA Diamond tests PhD-level science questions. Qwen3.7-Max scores 92.4, ahead of Opus-4.6 Max (91.3), K2.6 Thinking (90.5), and DS-V4-Pro Max (90.1).

In our review of GPT-5.5, we noted it scored 93.6% on GPQA Diamond, so GPT-5.5 still leads on this specific benchmark, though direct comparison requires noting the different evaluation conditions.

On HLE (Humanity's Last Exam), Qwen3.7-Max scores 41.4, ahead of Opus-4.6 Max (40.0) and Deepseek-V4-Pro Max (37.7).

On HMMT 2026 Feb (competition mathematics), it scores 97.1, the highest in the table, ahead of Opus-4.6 Max (96.2) and DS-V4-Pro Max (95.2).

On IMOAnswerBench, it scores 90.0, edging DS-V4-Pro Max's 89.8. On the Apex benchmark, it scores 44.5, well ahead of DS-V4-Pro Max's 38.3 and Opus-4.6 Max's 34.5.

Multilingual benchmarks

Qwen3.7-Max leads on WMT24++ (85.8 vs. Qwen3.6-Plus's 84.3 and Opus-4.6 Max's 82.7), which tests translation quality across 55 languages. On MAXIFE, it scores 89.2, ahead of DS-V4-Pro Max's 88.9. On PolyMATH, it scores 86.5, well ahead of Opus-4.6 Max's 80.2 and K2.6 Thinking's 82.7.

Multilingual performance is a consistent strength across the Qwen3.x line, and Qwen3.7-Max extends that lead.

Qwen3.7-Max Pricing and Availability

Qwen3.7-Max is available through Alibaba Cloud Model Studio, with API access via both OpenAI-compatible and Anthropic-compatible endpoints. The model ID is qwen3.7-max.

At the time of writing, Artificial Analysis lists the input and output pricing as $2.50 per 1M tokens input, and $7.50 for output.

The high verbosity flagged by Artificial Analysis (97M tokens generated in their evaluation vs. a median of 24M) means effective costs for long agentic sessions could be significantly higher than headline per-token rates suggest.

For developers, the model supports the preserve_thinking feature, which retains thinking content from all preceding turns in multi-turn conversations. The Qwen team recommends enabling this for agentic tasks. Integration with Claude Code, OpenClaw, and Qwen Code is supported out of the box, with configuration examples provided in the official documentation.

Final Thoughts

Qwen3.7-Max is a serious entry at the top of the agentic model market. The benchmark numbers are strong across the board, and the 35-hour kernel optimization demonstration is the kind of concrete, verifiable result that is harder to dismiss than a leaderboard position.

The cross-harness generalization story is also credible: training on diverse harness configurations is a principled approach to the problem of scaffold overfitting, and the consistent performance across Claude Code, OpenClaw, and Qwen Code backs it up.

Where I'd be cautious is on the verbosity. A model that generates 97M tokens in a standard evaluation suite is going to cost significantly more in practice than its per-token rate implies, especially for the long-horizon agentic sessions that are its primary use case.

Teams building on Qwen3.7-Max should budget carefully and test with explicit output constraints before committing to production workloads.

If you want to build on top of models like Qwen3.7-Max or understand the agentic AI landscape more deeply, I recommend checking out the AI Fundamentals skill track on DataCamp to get up to speed with the concepts underpinning these systems.

Author

Matt Crabtree

Ämnen

Artificial Intelligence

AI Agents

Top DataCamp Courses

track

Grunderna i AI-agenter

6 timmar

Upptäck hur AI-agenter kan förändra hur du arbetar och levererar värde för din organisation!

Se detaljer

Starta kursen

course

Bygga skalbara agentbaserade system

1 tim 30 min

15.8K

Upptäck vad som krävs för att skala AI-agenter, med lite hjälp från ramverk som MCP och A2A.

Se detaljer

Starta kursen

course

AI Agents with Hugging Face smolagents

3 timmar

2.4K

Learn how to build intelligent agents that reason, act, and solve real-world tasks using Python.

Se detaljer

Starta kursen

Se mer

Släkt

blog

Qwen3.5: Features, Access, and Benchmarks

Learn about the new Qwen3.5 series of models, covering the key features, costs, how to access, and how it compares to other similar models.

Tom Farnschläder

8 min

robot representing alibaba's qwen 2.5 max model

blog

Qwen 2.5 Max: Features, DeepSeek V3 Comparison & More

Learn about Alibaba's Qwen2.5-Max, a model that competes with GPT-4o, Claude 3.5 Sonnet, and DeepSeek V3.

Alex Olteanu

8 min

blog

QwQ 32B: Features, Access, DeepSeek-R1 Comparison, and More

Alibaba's Qwen team launched QwQ-32B, a 32-billion parameter, open-source AI model for complex reasoning, competing with larger models like DeepSeek-R1.

Alex Olteanu

6 min

blog

Qwen 3: Features, DeepSeek-R1 Comparison, Access, and More

Learn about the Qwen3 suite, including its architecture, deployment, and benchmarks compared to DeepSeek-R1 and Gemini 2.5 Pro.

Alex Olteanu

8 min

tutorial

Qwen (Alibaba Cloud) Tutorial: Introduction and Fine-Tuning

Qwen is a family of large language and multimodal models developed by Alibaba Cloud, designed for various tasks like text generation, image understanding, and conversation.

Dr Ana Rojo-Echeburúa

tutorial

Qwen-Agent: A Guide With Demo Project

Learn how to use Qwen-Agent and Qwen3 to build a real-time webpage summarizer extension.

Aashi Dutt

Se mer Se mer

What is Qwen3.7-Max?

What's New With Qwen3.7-Max?

Long-horizon autonomous execution

Cross-harness generalization

MCP and multi-agent orchestration

Reward hacking self-monitoring

Physical-world navigation via robot integration

Qwen3.7-Max Benchmarks

Coding agent benchmarks

General agent benchmarks

STEM and reasoning benchmarks

Multilingual benchmarks

Qwen3.7-Max Pricing and Availability

Final Thoughts

Qwen3.5: Features, Access, and Benchmarks

Qwen 2.5 Max: Features, DeepSeek V3 Comparison & More

QwQ 32B: Features, Access, DeepSeek-R1 Comparison, and More

Qwen 3: Features, DeepSeek-R1 Comparison, Access, and More

Qwen (Alibaba Cloud) Tutorial: Introduction and Fine-Tuning

Qwen-Agent: A Guide With Demo Project

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Grunderna i AI-agenter

Bygga skalbara agentbaserade system

AI Agents with Hugging Face smolagents

Qwen3.5: Features, Access, and Benchmarks

Qwen 2.5 Max: Features, DeepSeek V3 Comparison & More

QwQ 32B: Features, Access, DeepSeek-R1 Comparison, and More

Qwen 3: Features, DeepSeek-R1 Comparison, Access, and More

Qwen (Alibaba Cloud) Tutorial: Introduction and Fine-Tuning

Qwen-Agent: A Guide With Demo Project

Grunderna i AI-agenter