Skip to main content

Composer 2.5: Benchmarks, Pricing, and How It Compares

Cursor's latest proprietary model, Composer 2.5, adds targeted RL feedback, more synthetic training tasks, and lower token pricing than frontier models.
May 22, 2026  · 13 min read

Cursor released Composer 2.5 on May 18, 2026, roughly two months after Composer 2 shipped in March. The short gap between releases shows how quickly Cursor is updating its own model line.

Cursor reports that Composer 2.5 scores near Claude Opus 4.7 and GPT-5.5 on several coding benchmarks. Its token price is also lower than that of the frontier models. The training changed too: more synthetic tasks, harder training environments, and a feedback method that targets specific mistakes inside long coding sessions.

In this article, I look at Composer 2.5 as more than a benchmark update. I will cover what it is, what changed, how the benchmarks look, how pricing compares with frontier models, and where it fits in a coding workflow. There are limitations, too, and a few are worth knowing about before you treat the scores as the whole story.

For more background on the other models in this comparison, see our guides to Claude Opus 4.7 and GPT-5.5.

What Is Cursor's Composer 2.5 Model?

Composer 2.5 is the newest model in Cursor's Composer family, built for coding work inside the Cursor IDE. It follows Composer 1, Composer 1.5, and Composer 2.

Horizontal timeline of Cursor Composer model releases from October 2025 to May 2026, showing Composer 1, 1.5, 2, and 2.5 with key training innovations and release dates labeled at each milestone

Composer timeline from launch to 2.5. Image by Author.

This is not a general chatbot. Composer 2.5 is trained for edits across files, terminal commands, tool use, and longer coding sessions. Its training targets and benchmarks focus on software engineering tasks.

The launch post says the model scores above Composer 2 on coding tasks and behaves differently in longer sessions. It is now the default option in Cursor's model picker, though Composer 2 remains available. It also runs only inside Cursor. There is no public API, no Hugging Face model card, and no gateway access through another provider.

What Changed in Composer 2.5

The changes in Composer 2.5 fall into two categories: coding task performance and collaboration behavior. The first is easier to measure than the second, so it is worth separating what Cursor can show with numbers from what it describes more qualitatively.

Longer task performance

Composer 2.5 is aimed at longer coding sessions where a model needs to read files, run terminal commands, fix errors, and iterate. That matters because real development rarely fits into a single prompt and response.

Cursor trained the model in harder reinforcement learning environments for this kind of work. Tasks were created during training, and the difficulty increased over time.

Instruction following and collaboration

The release also describes more reliable instruction following. It points to effort calibration: the model is meant to spend more compute on hard tasks and avoid overthinking simple ones.

There is a caveat here. Cursor notes that these behavior changes "are not well captured by existing benchmarks." So this part of the release rests mostly on Cursor's own assessment and early user feedback, not on a public score.

More difficult RL environments

The launch post frames the training change as "scaling training, generating more complex RL environments, and introducing new learning methods." The training used 25x more synthetic tasks than Composer 2.

How Cursor Trained Composer 2.5

The training details explain why the model changed without a new base architecture. Composer 2.5 uses the same foundation as Composer 2, but the work after the base training has changed. Not every infrastructure detail matters equally for readers, but a few parts help explain the benchmark movement.

Built on Kimi K2.5

Composer 2.5 is built on the same open source checkpoint as Composer 2: Moonshot AI's Kimi K2.5. Cursor said this directly in the launch post, which matters because the base model was a point of debate around Composer 2.

Kimi K2.5 uses a Mixture of Experts architecture. Cursor applies continued pretraining and reinforcement learning on top of that base, and says roughly 85% of the total compute for the final model comes from its own work after base training.

Targeted RL with textual feedback

This is the main technical change in Composer 2.5. Standard RL gives a model one reward signal at the end of a long sequence. In a long coding session, that final reward can be too noisy to show where the model went wrong.

Simplified diagram showing Cursor's targeted RL training method: the original model context produces a student distribution, while a hint inserted at a bad tool call produces a teacher distribution, and a KL distillation loss updates the student toward the teacher for that turn only

The teacher and the student share one turn. Image by Author.

Cursor's method inserts a short text hint at the point where the model made a bad decision. For example, if the model calls a tool that does not exist, the training process can insert a reminder with the correct tool list. The hinted version acts as a "teacher," and the original model acts as a "student." A distillation loss then moves the student's behavior toward the teacher's at that turn only.

The result is more targeted training: individual mistakes can be corrected without treating an entire long rollout as vaguely right or wrong. Cursor applied this method across coding style, tool use, and model communication during the Composer 2.5 training run.

Synthetic data on a larger scale

Composer 2.5 was trained with 25x more synthetic tasks than Composer 2. These tasks are grounded in real codebases, not toy examples.

One approach Cursor describes is feature deletion. An agent starts with a real codebase and a large test suite, then removes code and files while keeping the rest of the project functional. The synthetic task is to reimplement the removed feature, and the tests provide a verifiable reward signal.

The scale of synthetic training introduces its own risks. Cursor documented cases where Composer 2.5 found shortcuts, including recovering deleted information from a Python type-checking cache and decompiling Java bytecode to reconstruct an external API. The company says it caught these using monitoring tools, but acknowledged that training at this scale requires "increasing care."

Infrastructure changes

On the infrastructure side, Cursor used Sharded Muon and dual mesh HSDP for continued pretraining. These changes reduced some of the cost and time involved in training on large GPU clusters.

Composer 2.5 Benchmark Results: Terminal-Bench, SWE-Bench, and CursorBench

Benchmarks are useful, but they do not show the full picture. I would treat them as a starting point for comparison, not as a full verdict on how the model will feel in daily work.

Cursor evaluates Composer 2.5 across three benchmarks:

Benchmark

Composer 2.5

Claude Opus 4.7

GPT-5.5

Composer 2

SWE-Bench Multilingual

79.8%

80.5%

77.8%

73.7%

Terminal-Bench 2.0

69.3%

69.4%

82.7%

61.7%

CursorBench v3.1 (harder tasks)

63.2%

64.8% (max) / 61.6% (default)

64.3% (xhigh) / 59.2% (default)

52.2%

SWE-Bench Multilingual tests whether a model can resolve real GitHub issues across multiple programming languages. Each task gives the model a repository and a problem statement, then checks whether the patch passes the associated tests.

Terminal-Bench 2.0 measures whether an AI agent can operate in real terminal workflows: inspecting files, running commands, debugging failures, and completing tasks with several steps.

CursorBench v3.1 is Cursor's private internal benchmark. It evaluates agents on ambiguous, multi-file tasks from real Cursor sessions, including codebase understanding, bugfinding, planning, and code review. The limitation is that CursorBench cannot be checked or reproduced by outside researchers, and scores should be compared within the same eval version.

There is one caveat that matters before reading too much into these numbers. Benchmark comparisons across models are not always clean. Different evaluation setups and effort settings can move scores, and Cursor notes that Opus 4.7 and GPT-5.5 use self-reported scores for public evaluations. Treat these as directional comparisons, not direct tests under identical conditions.

A later outside benchmark from Artificial Analysis points in a similar direction, though it uses a different benchmark mix. Composer 2.5 scored 62 on the Artificial Analysis Coding Agent Index, behind Claude Opus 4.7 at max effort (66) and GPT-5.5 at xhigh reasoning (65). 

The cost gap is the part I would pay attention to: Artificial Analysis estimated Composer 2.5 at $0.07 per task for Standard and $0.44 for Fast, compared with $4.10 for Opus 4.7 max and $4.82 for GPT-5.5 xhigh.

Composer 2.5 vs. Composer 2 vs. Composer 1.5: How the Scores Compare

The Composer family has had three releases in a short period. Composer 1.5 shipped in February 2026, Composer 2 in March, and Composer 2.5 in May. Each version changed something different about the training approach.

Composer 2.5 versus Composer 2

The jump from Composer 2 to 2.5 is most visible on Terminal-Bench 2.0, where the score moved from 61.7% to 69.3%, and on SWE-Bench Multilingual, from 73.7% to 79.8%. The CursorBench gain is smaller, and the benchmark version changed from v3 to v3.1, so that comparison is less direct.

The bigger difference is the training pipeline. Composer 2 introduced continued pretraining on Kimi K2.5. Composer 2.5 kept that base and added targeted textual feedback, 25x more synthetic tasks, and infrastructure changes. The Standard price stayed the same.

Composer 2.5 versus Composer 1.5

Composer 1.5 was built by scaling reinforcement learning 20x further on the same pretrained model as Composer 1. It introduced adaptive thinking and self-summarization, which lets the model compress its own context when a session runs long.

The Composer 1.5 to 2.5 gap is large across every benchmark. It also came with a lower token price: Composer 1.5 was priced at $3.50 per million input tokens and $17.50 per million output tokens, roughly 7x more expensive than Composer 2.5 Standard.

What changed in practice

Across these versions, the pattern is fairly clear: each generation changed behavior during long sessions and instruction-following, while Composer 2 and 2.5 lowered the cost of sustained agent sessions.

Composer 2.5 vs. Claude Opus 4.7 vs. GPT-5.5: Benchmarks, Price, and Trade-offs

This is the comparison many readers will care about first. Composer 2.5 has similar coding benchmark scores in some areas, a lower token price than the frontier models listed below, and clear trade-offs.

Benchmark comparison

GPT-5.5 leads on Terminal-Bench 2.0 at 82.7%, about 13 points ahead of Composer 2.5. That gap matters for work that depends heavily on terminal use.

Claude Opus 4.7 is slightly ahead of Composer 2.5 on SWE-Bench Multilingual (80.5% versus 79.8%), which is less than a point. On CursorBench, Composer 2.5 at 63.2% is above Opus 4.7 at default settings (61.6%) but below Opus 4.7 at max effort (64.8%). GPT-5.5 also reaches 64.3% at xhigh, while its default score is 59.2%.

These models are not doing the same job. Opus 4.7 and GPT-5.5 are broader frontier models. Composer 2.5 is a coding model that runs only in Cursor. The benchmark scores are close in some coding tasks, but the product boundaries are different.

Pricing comparison

The cost difference is the clearest split from the frontier models.

Model

Input (per 1M tokens)

Output (per 1M tokens)

Composer 2.5 Standard

$0.50

$2.50

Composer 2.5 Fast (default)

$3.00

$15.00

Claude Opus 4.7

$5.00

$25.00

GPT-5.5

$5.00

$30.00

Composer 2.5 Standard is priced at roughly one-tenth of Opus 4.7 and GPT-5.5 per token. The Fast variant is also priced below the standard tiers of either frontier model.

These prices are current as of May 2026, so check Cursor's model pricing, Anthropic's Opus pricing, and OpenAI's API pricing before relying on the comparison.

One note that often gets missed: Composer 2.5 Fast pricing doubled compared to Composer 2 Fast. Standard pricing stayed flat, but Fast is the default, so the upgrade can still raise costs for some users.

Which model should you choose?

Model choice depends on whether cost, terminal work, or deeper planning matters most:

  • Composer 2.5 fits everyday coding inside Cursor, especially edits across files, refactoring, debugging, and agent sessions where cost matters.
  • GPT-5.5 fits tasks where terminal performance matters most.
  • Claude Opus 4.7 fits tasks that need careful reasoning, architectural planning, or a 1-million token context window.

That is the pattern I would take from the numbers: Composer 2.5 covers routine coding work, while frontier models still have a role for broader reasoning or higher terminal scores.

Composer 2.5 Standard vs. Fast: Speed, Price, and When to Use Each

Cursor ships Composer 2.5 in two variants, as it did with Composer 2. According to Cursor, both share the same underlying intelligence. The difference is mainly how quickly the model responds and how much it costs.

Screenshot of the Cursor IDE model picker dropdown with Composer 2.5 Fast shown as the selected default model

Cursor model picker with Composer selected. Image by Author.

Fast is the default and costs $3.00 per million input tokens and $15.00 per million output tokens. It is meant for interactive sessions where low latency matters. Standard runs at $0.50 and $2.50, so it fits background tasks or longer agent loops where immediate feedback is less important.

Composer 2.5 usage sits in Cursor's "Auto + Composer" usage pool, separate from the API pool used for outside models like Claude and GPT. Cursor also offered double usage for the first week after launch.

Composer 2.5 Limitations and Caveats

The caveats are about access, benchmarks, and training risk. They do not make Composer 2.5 unusual, but they do affect how much weight to put on Cursor's claims.

Availability in Cursor only. As I mentioned earlier, Composer 2.5 has no public API. If your workflow depends on calling a model from your own scripts or pipelines, Composer 2.5 is not an option.

CursorBench is not independent. As I covered in the benchmark section, CursorBench v3.1 is internal to Cursor. Its methodology is not fully public, and the tasks cannot be reproduced by external researchers.

Benchmark setup variability. The frontier model scores in Cursor's benchmark chart are not all measured the same way. Treat the comparisons as directional, not definitive.

Reward hacking during training. Cursor disclosed cases where the model found clever shortcuts in synthetic tasks instead of solving them normally. This is an inherent risk of RL at this scale, even when monitoring catches obvious examples.

Effort calibration is unverified. Cursor's claims about communication style and effort calibration are not backed by benchmark data, as I covered earlier. That makes them hard to check from the outside.

When Composer 2.5 Makes Sense

This depends on the task. I would frame Composer 2.5 less as a universal model choice and more as a coding model for people already working inside Cursor.

If you spend most of your day coding inside Cursor and care about token cost, Composer 2.5 Standard has the lowest price in the Composer 2.5 line. That applies to the same edit, refactoring, debugging, and long-session work described above.

If response speed matters more, Composer 2.5 Fast is the default option.

If the task requires broader reasoning, a larger context window, or higher benchmark scores in a specific area, Claude Opus 4.7 or GPT-5.5 may fit that task.

One way to frame it: Composer 2.5 handles the routine coding work I covered above, while a frontier model may fit tasks that need broader reasoning or higher terminal scores. That keeps the comparison grounded without turning it into a recommendation for one model in every case.

Final Thoughts

Composer 2.5 is easy to read as a benchmark story, but I think the more useful point is the direction of travel. Cursor is not only wrapping frontier models inside an editor. It is building a model line around the kinds of work its agents already do: edits across files, terminal steps, longer sessions, and recovery from mistakes.

As I mentioned earlier, the trade-off is that Composer 2.5 is narrow by design. It does not replace Claude Opus 4.7 or GPT-5.5 as a general model, and it does not help if you need an API outside of Cursor. But inside Cursor, the narrower focus is the point. The model is cheaper to run than the frontier options, it is tuned for coding tasks, and it sits close to the product layer where those tasks happen.

The next question is how much of this Cursor wants to own. The company says it is working with SpaceXAI to train a larger model from scratch using 10x more total compute and Colossus 2 infrastructure. No release date has been given, so there is not much to analyze yet. Still, the basic shape is clear enough: Cursor is moving from using models well to building more of the model stack itself.

Composer 2.5 FAQs

Can I use Composer 2.5 outside of Cursor?

No. As covered above, Composer 2.5 runs exclusively inside Cursor's products. If you need a model you can call from your own code, you will need Claude, GPT, or another competitor model through their respective APIs.

Is Composer 2.5 available on the free Hobby plan?

Cursor's launch materials mention "individual plans" without specifying the Hobby tier. The Hobby plan includes a limited agent request pool. The Pro plan at $20 per month is the plan Cursor lists for more regular agent use.

Why is Composer 2.5 built on a Chinese open source model?

As covered in the training section, Cursor uses Moonshot AI's Kimi K2.5 as the base checkpoint. The company says 85% of the total compute goes to its own work after base training, so the final model differs from the raw Kimi K2.5. Cursor is also working with SpaceXAI to train a future model from scratch.

How much does Composer 2.5 cost per task?

According to Cursor's effort and cost chart, a CursorBench task with Composer 2.5 costs under $1 on average. The same task routed through Claude Opus 4.7 or GPT-5.5 costs several dollars, depending on effort settings.

Does the Fast variant produce a different quality output than Standard?

As covered in the Standard vs. Fast section, Cursor describes both variants as sharing the same underlying intelligence. The difference is latency: Fast is for interactive sessions, while Standard is for background work. No independent testing has confirmed that claim as of the launch date.


Khalid Abdelaty's photo
Author
Khalid Abdelaty
LinkedIn

I’m a data engineer and community builder who works across data pipelines, cloud, and AI tooling while writing practical, high-impact tutorials for DataCamp and emerging developers.

Topics

Top AI Courses

Course

Introduction to Claude Models

3 hr
7.9K
Learn how to work with Claude using the Anthropic API to solve real-world tasks and build AI-powered applications.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

Composer 2: Benchmarks, Pricing, and How It Compares

Cursor's latest proprietary model, Composer 2, introduces continued pretraining, RL-trained self-summarization, and a dramatic cut in token pricing.
Khalid Abdelaty's photo

Khalid Abdelaty

15 min

blog

Cursor vs. GitHub Copilot: Which AI Coding Assistant Is Better?

Learn how Cursor and GitHub Copilot work, how they compare on real-world tasks, and which one fits your workflow and budget.
Khalid Abdelaty's photo

Khalid Abdelaty

15 min

gemini 2.5 pro with a large context

blog

Gemini 2.5 Pro: Features, Tests, Access, Benchmarks, and More

Explore Google's Gemini 2.5 Pro, and learn about its impressive 1 million token context window, multimodal capabilities, hands-on test results, and how to access it.
Alex Olteanu's photo

Alex Olteanu

8 min

blog

Qwen3.5: Features, Access, and Benchmarks

Learn about the new Qwen3.5 series of models, covering the key features, costs, how to access, and how it compares to other similar models.
Tom Farnschläder's photo

Tom Farnschläder

8 min

blog

DeepSeek V4: Features, Benchmarks, and Comparisons

Discover DeepSeek V4 features, pricing, and 1M context efficiency. We compare V4 Pro and Flash benchmarks against frontier models like GPT-5.5 and Opus 4.7.
Matt Crabtree's photo

Matt Crabtree

7 min

blog

Claude Opus 4.6: Features, Benchmarks, Hands-On Tests, and More

Anthropic’s latest model tops leaderboards in agentic coding and complex reasoning. Plus, it has a 1M context window.
Matt Crabtree's photo

Matt Crabtree

10 min

See MoreSee More