Claude Opus 4.5: Benchmarks, Agents, Tools, and More

Discover Claude Opus 4.5 by Anthropic, its best model yet for coding, agents, and computer use. See benchmark results, new tools, and real-world tests.

Nov 25, 2025 · 10 min read

Hot on the heels of the impressive Gemini 3 release from Google, Anthropic recently announced Claude Opus 4.5, hailing it as the ‘best model in the world for coding, agents, and computer use.’

Despite Gemini 3’s extremely impressive benchmarks, it still came behind Claude Sonnet 4.5 on the SWE bench, which is a test of software engineering skills. With Claude Opus 4.5, Anthropic beats its own score on the SWE bench and other tests.

Claude Opus 4.5 marks Anthropic's third major model launch in just two months. after Sonnet 4.5 and Haiku 4.5. And now that Anthropic has a valuation north of $350 billion, we know they have the resources to keep running at this pace.

In this article, I’ll explore everything that’s new with Claude Opus 4.5, looking at the benchmarks, new features, and getting hands-on to test its capabilities.

What is Claude Opus 4.5?

Claude Opus 4.5 is the latest large language model from Anthropic. Following on from Opus 4, it is Anthropic’s most advanced model, with a focus on coding, reasoning, and long-running tasks. The model scores 80.9% on the SWE-bench and 59.3% on the Terminal-bench.

Claude Opus 4.5 is now available on Anthropic’s apps, the API, and major cloud platforms.

What’s new with Claude Opus 4.5?

From the announcement, these things stood out to me:

Agentic coding: Opus 4.5 is state-of-the-art on SWE-bench Verified, outperforming Google's Gemini 3 Pro and OpenAI's GPT-5.1. Anthropic also tested the model on a difficult take-home exam that they give to prospective performance engineers. It scored higher than any human candidate ever had.
Computer use: Anthropic claims Opus 4.5 is "the best model in the world for computer use," meaning it can interact with software interfaces, clicking buttons, filling forms, navigating websites, as a human would.
Everyday tasks: The model is "meaningfully better" at working with spreadsheets and slides and conducting deep research, according to Anthropic.

Alongside the model launch, Anthropic announced several product updates, which I’ll discuss in more detail below. They include Claude for Chrome, the browser extension that lets Claude take action across tabs, and Claude for Excel.

Testing Claude Opus 4.5

A write-up wouldn’t be complete without putting the new model through a test or two. Let’s see how the new improvements handle a range of tasks:

Testing Opus 4.5 with an economics question

First, I wanted to see how Opus 4.5 handled a classic economics optimization problem. I'm picturing a log-linear demand model that predicts quantity sold from price. It's the kind of thing you'd see in an econometrics course, but it's also genuinely useful. If you know your demand curve and your costs, you can solve for the price that maximizes revenue.

Opus 4.5 gave me the right answer, without thinking long about it, and it happened all in one go.

I was impressed by this because just about any company that sells anything will have some idea of quantity sold and price, but not every company will have the resources to sketch out answers to rudimentary optimization problems.

But here, with a well-designed prompt, you can find your answer easily. Of course, it would still be up to the analyst to decide if enough of the constraints were really modeled properly.

Here’s what I also liked: Opus 4.5 showed its work. It didn't just give me a number. It walked through the derivative, the factoring, and the algebra. If it had made an error, I could have spotted where.

Testing Opus 4.5 with a stats question

Next, I wanted to test Opus 4.5 on a stats question. I created a dataset where yield was a function of temperature with noise added.

library(tidyverse)

set.seed(1024)

# Generate data
n <- 100
df <- data.frame(
  temperature = seq(0, 100, length.out = n)
) %>%
  mutate(yield = 0.5 * temperature + 10 + rnorm(n, 0, 5))

I then fed Claude this prompt.

Claude returned to me the following, which I used to create a histogram, down below.

# Fit model
fit <- lm(yield ~ temperature, data = df)
fitted_vals <- fitted(fit)
resids <- residuals(fit)
original_slope <- coef(fit)[2]

# Residual bootstrap
set.seed(123)
n_boot <- 1000
boot_slopes <- numeric(n_boot)

for (i in 1:n_boot) {
  new_y <- fitted_vals + sample(resids, replace = TRUE)
  boot_fit <- lm(new_y ~ df$temperature)
  boot_slopes[i] <- coef(boot_fit)[2]
}

# Confidence intervals
ci_lower <- quantile(boot_slopes, 0.025)
ci_upper <- quantile(boot_slopes, 0.975)

# Plot
ggplot(data.frame(slope = boot_slopes), aes(x = slope)) +
  geom_histogram(bins = 40, fill = "gray70", color = "white") +
  geom_vline(xintercept = original_slope, color = "red", linewidth = 1) +
  geom_vline(xintercept = ci_lower, color = "steelblue", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = ci_upper, color = "steelblue", linetype = "dashed", linewidth = 1) +
  labs(
    title = "Bootstrap Estimate: Effect of Temperature on Yield",
    subtitle = paste0("Estimate: ", round(original_slope, 3), 
                      " | 95% CI: [", round(ci_lower, 3), ", ", round(ci_upper, 3), "]"),
    x = "Slope (yield per °C)",
    y = "Count"
  ) +
  theme_minimal()

I have to say that I like this result. Opus 4.5 found a confidence interval for the slope, which is what I was really asking for.

It used a bootstrap method, which is a good technique for finding confidence intervals when heteroscedasticity is present. It also more specifically used a bootstrap on the residuals, instead of another bootstrap method that resamples (X,Y) pairs, which would have assumed error in the X term.

All this is a finer but likely important point for people who do this kind of work: A residual bootstrap would be better when X is fixed by design, as I specified in my prompt, and when you therefore want inference conditional on those exact X values, like with a scientific study. What I’m trying to say is that Opus 4.5 listened to the subtleties in the prompt.

Testing Opus 4.5 with a math question

Next, I wanted to see if I could trip Opus 4.5 up in some way.

In this case, the model was way ahead of me. Opus 4.5 understood that my question was circular - I was defining X by Y and vice-versa, so we couldn’t pull out a meaningful estimate for the slope of a regression line, given what we were working with.

Testing Opus 4.5 for SQL optimization

Finally, I thought it might be fun to test Opus 4.5 on a SQL question. I shipped it a workable but highly inefficient SQL query.

-- Find customers who placed orders above their average order value in 2024
SELECT 
    c.customer_id,
    c.customer_name,
    o.order_id,
    o.order_date,
    o.order_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_date >= '2024-01-01'
    AND o.order_date < '2025-01-01'
    AND o.order_total > (
        SELECT AVG(o2.order_total)
        FROM orders o2
        WHERE o2.customer_id = c.customer_id
    )
    AND c.customer_id IN (
        SELECT DISTINCT o3.customer_id
        FROM orders o3
        WHERE o3.order_date >= '2024-01-01'
    )
    AND EXISTS (
        SELECT 1
        FROM customer_preferences cp
        WHERE cp.customer_id = c.customer_id
            AND cp.email_opt_in = 1
    )
ORDER BY (
    SELECT COUNT(*)
    FROM orders o4
    WHERE o4.customer_id = c.customer_id
) DESC;

I asked Opus 4.5 to make the query better. It gave me an answer in about a second.

-- Find customers who placed orders above their average order value in 2024
WITH customer_avg AS (
    SELECT 
        customer_id,
        AVG(order_total) AS avg_order_total,
        COUNT(*) AS order_count
    FROM orders
    GROUP BY customer_id
)
SELECT 
    c.customer_id,
    c.customer_name,
    o.order_id,
    o.order_date,
    o.order_total
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN customer_avg ca ON c.customer_id = ca.customer_id
JOIN customer_preferences cp ON c.customer_id = cp.customer_id
WHERE o.order_date >= '2024-01-01'
    AND o.order_date < '2025-01-01'
    AND o.order_total > ca.avg_order_total
    AND cp.email_opt_in = 1
ORDER BY ca.order_count DESC;

Among other things, Opus 4.5 replaced the correlated subquery for AVG with a CTE. (The original query recalculated AVG(order_total) for every single row being evaluated, but the CTE now computes each customer's average once upfront.) Opus 4.5 also removed the redundant IN (SELECT DISTINCT ...) clause, which was doing extra work for nothing. And it converted the EXISTS subquery to a JOIN, which is better.

Claude Opus 4.5 Agents and Agentic Capabilities

There were other highlights from this latest release from Anthropic. Let’s take a look in more detail:

Developer platform features

Anthropic has added several new building blocks for developers. The most notable is the effort parameter, which lets you control how much thinking the model does before responding. Set it low for quick, lightweight tasks; set it higher when you need the model to chew on something.

According to Anthropic, at medium effort, Opus 4.5 matches Sonnet 4.5's best SWE-bench score while using 76% fewer output tokens. At high effort, it beats Sonnet by over 4% — still using nearly half the tokens.

There's also improved context management and memory. For long-running agents, Claude can now automatically summarize earlier context so it doesn't hit a wall mid-task. This pairs with context compaction, which keeps agents running longer with less intervention.

Finally, Opus 4.5 is apparently quite good at multi-agent orchestration, which means it could manage a team of subagents.

Claude Opus 4.5 Deep Research

Apparently, Opus 4.5’s performance on a deep research evaluation increased by roughly 15%.

I put the deep research capability through my own test. I asked it to give me a report on Old English words that survive today but are not commonly used, and how these kinds of words have changed over time. The report was ready in seven minutes:

I was actually fairly impressed with the quality of the report in terms of how interesting it was, the writing quality, the organization, and the depth of research.

It wasn’t a totally dry document that was overly dense with archaic or obsolete words. I admit some of the sections could have been better differentiated. The middle of the report was a little more boring than the rest. But - and here’s the real point - it was extremely well researched. Just look at these citations:

Plus, I learned a very nice new word: apricity, meaning the warmth of the sun in winter.

Claude Code

Anthropic has shipped two upgrades to Claude Code.

What is called Plan Mode now builds more precise plans before executing. In practice, this means Claude asks clarifying questions upfront, then generates an editable plan.md file you can review and tweak before it starts working. The idea here is to save you from a false start.

Desktop app integration is the other big one. Claude Code is now available in the desktop app, which means you can run multiple local and remote sessions in parallel. To use Anthropic's own example: one agent fixes bugs, another researches GitHub issues, and a third updates docs.

Consumer App Agents

A few agentic features have landed in the Claude consumer apps:

Claude for Chrome lets Claude handle tasks across your browser tabs. It's now available to all Max users. Think of it as a browsing agent that can navigate, click, fill forms, and pull information across multiple sites.

Claude for Excel brings spreadsheet automation to Claude. This one was announced earlier, but Anthropic has now expanded beta access to all Max, Team, and Enterprise users, so it’s becoming more real.

Long conversation handling fixes what was an annoying limitation. Previously, long chats would hit a context limit and just stop, in which case you'd have to start a new conversation. Now Claude automatically summarizes earlier parts of the conversation in the background, which frees up space so you don’t hit a wall.

Claude Opus 4.5 Benchmark Results

Opus 4.5, like the previous Claude models, was tested on the common benchmarks in agentic coding, tool use, computer use, and problem solving. Here are the big results for Opus 4.5.

Opus 4.5 gets top rank in a lot of the most important tests. Anything involving actually doing things, like writing code that passes tests (SWE-bench), using tools in multi-step workflows (τ2-bench, MCP Atlas), and operating a computer (OSWorld). Opus 4.5 leads, often by a lot.

The scaled tool use gap is actually very large: 62.3% vs. 43.8% for the next best, which is also Claude! However, this demonstrates how much Anthropic is investing in improving its performance, particularly on agentic tasks. Although their older model might currently lead in some categories, they don’t slow down their progress.

It looks like Gemini 3 Pro has the edge on some of the knowledge-intensive benchmarks, like graduate-level reasoning (GPQA Diamond) and multilingual Q&A (MMMLU). These benchmarks probably reward breadth of training data and careful reasoning over memorized facts, and Google, of course, has a lot of resources.

Advances in Safety and Alignment

The improvements from Anthropic aren’t just around coding, research, and computer use. There is very much an emphasis on safety, claiming that this model is their safest and ‘most robustly aligned model’ they have released yet.

This claim is backed up by a reduction in what they call ‘concerning behavior score’, which is lower than the other 4.5 models, as well as GPT-5.1 and Gemini 3 Pro. Anthropic has also made Opus 4.5 more robust against prompt injection attacks, which can fool the model into harmful behaviour.

Claude 4.5 Opus Pricing and Availability

Claude Opus 4.5 is available today across Anthropic’s full product lineup, including the Claude app, API, and all three major cloud platforms. Developers can access it directly via the model ID claude-opus-4-5-20251101.

Pricing has also taken a notable step down. At $5 per million input tokens and $25 per million output tokens, Opus 4.5 makes Anthropic’s top-tier capabilities far more accessible.

For enterprise users, the pricing change could be especially meaningful. The combination of reduced cost, expanded API access, and broader availability across cloud providers positions Opus 4.5 as a competitive option.

Conclusion

Claude Opus 4.5 is Anthropic's clearest statement yet about where it sees itself in the AI race. While Google pushes multimodal understanding and on-device models, Anthropic is betting heavily on actions: agentic coding, tool use, and computer interaction.

The benchmark results really tell the story: Opus 4.5 achieves the highest scores ever recorded on software engineering benchmarks and handles multi-system debugging without much guidance.

My testing confirms what the benchmarks suggest: Opus 4.5 is very capable at multi-step work. Whether it was running bootstrap simulations or synthesizing research across papers, the model approached problems the way a thinker would: adaptively and with clear reasoning. I would think if you are looking to augment your workflows, this is what matters.

If you're keen to learn more about Claude models, I recommend checking out the course, Introduction to Claude Models.

How does Claude Opus 4.5 differ from Sonnet 4.5 and Haiku 4.5?

Where can I access Claude Opus 4.5?

Did Anthropic increase Claude’s context or memory capabilities?

What safety standards does Opus 4.5 follow?

What types of use cases benefit most from Opus 4.5?

Author

Josef Waples

Author

Matt Crabtree

Topics

Artificial Intelligence

Learn with DataCamp

Course

Understanding Artificial Intelligence

2 hr

350K

Learn the basic concepts of Artificial Intelligence, such as machine learning, deep learning, NLP, generative AI, and more.

See Details

Start Course

Course

Implementing AI Solutions in Business

2 hr

41.1K

Discover how to extract business value from AI. Learn to scope opportunities for AI, create POCs, implement solutions, and develop an AI strategy.

See Details

Start Course

Course

Introduction to Claude Models

3 hr

1.1K

Learn how to work with Claude using the Anthropic API to solve real-world tasks and build AI-powered applications.

See Details

Start Course

blog

Claude 4: Tests, Features, Access, Benchmarks, and More

Learn about Claude Sonnet 4 and Claude Opus 4, their features, use cases, benchmarks, and testing results.

Alex Olteanu

8 min

Claude Sonnet 4.5 hailed as the best at coding in the world

blog

Claude Sonnet 4.5: Tests, Features, Access, Benchmarks, and More

Learn about Claude Sonnet 4.5, the ‘best coding model in the world’. Explore new features, use cases, benchmarks, and testing results, plus a look at the Claude Agents SDK and Claude Imagine.

Matt Crabtree

8 min

blog

Anthropic Computer Use: Automate Your Desktop With Claude 3.5

Discover Anthropic’s new computer use feature and let Claude manage your workspace and automate your tasks. Simply type the prompt, and Claude will handle the rest.

Abid Ali Awan

9 min

blog

Claude Haiku 4.5: Features, Testing Results, and Use Cases

Discover Anthropic’s new Haiku 4.5 model release. Explore new features, testing results, and use cases.

Srujana Maddula

Tutorial

Claude Opus 4 with Claude Code: A Guide With Demo Project

Plan, build, test, and deploy a machine learning project from scratch using the Claude Opus 4 model with Claude Code.

Abid Ali Awan

Tutorial

Claude for Chrome: AI-Powered Browser Assistance and Automation

Learn how to install and use Claude for Chrome, Anthropic’s AI-powered browser assistant. Discover how it automates form filling, email cleanup, and web tasks, plus key safety tips and current limitations.

Bex Tuychiev

See More See More

What is Claude Opus 4.5?

What’s new with Claude Opus 4.5?

Testing Claude Opus 4.5

Testing Opus 4.5 with an economics question

Testing Opus 4.5 with a stats question

Testing Opus 4.5 with a math question

Testing Opus 4.5 for SQL optimization

Claude Opus 4.5 Agents and Agentic Capabilities

Developer platform features

Claude Opus 4.5 Deep Research

Claude Code

Consumer App Agents

Claude Opus 4.5 Benchmark Results

Advances in Safety and Alignment

Claude 4.5 Opus Pricing and Availability

Conclusion

Claude Opus 4.5 FAQs

Did Anthropic increase Claude’s context or memory capabilities?

What safety standards does Opus 4.5 follow?

What types of use cases benefit most from Opus 4.5?

Claude 4: Tests, Features, Access, Benchmarks, and More

Claude Sonnet 4.5: Tests, Features, Access, Benchmarks, and More

Anthropic Computer Use: Automate Your Desktop With Claude 3.5

Claude Haiku 4.5: Features, Testing Results, and Use Cases

Claude Opus 4 with Claude Code: A Guide With Demo Project

Claude for Chrome: AI-Powered Browser Assistance and Automation

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Understanding Artificial Intelligence

Implementing AI Solutions in Business

Introduction to Claude Models

Claude 4: Tests, Features, Access, Benchmarks, and More

Claude Sonnet 4.5: Tests, Features, Access, Benchmarks, and More

Anthropic Computer Use: Automate Your Desktop With Claude 3.5

Claude Haiku 4.5: Features, Testing Results, and Use Cases

Claude Opus 4 with Claude Code: A Guide With Demo Project

Claude for Chrome: AI-Powered Browser Assistance and Automation

Understanding Artificial Intelligence