Claude 4: Tests, Features, Access, Benchmarks, and More

Learn about Claude Sonnet 4 and Claude Opus 4, their features, use cases, benchmarks, and testing results.

May 23, 2025 · 8 min read

Anthropic just released Claude 4, which comes in two versions: Claude 4 Sonnet and Claude 4 Opus.

Claude Sonnet 4 is a generalist model that’s great for most AI use cases and especially strong at coding. I think it’s one of the best models you can use for free.

Claude Opus 4 is designed for reasoning-heavy tasks like agentic search and long-running code workflows. Anthropic calls Opus 4 “the best coding model in the world,” but I find that claim a bit empty.

Yes, it’s currently the top performer on the SWE-bench Verified benchmark. But with a context window of just 200K, I can’t imagine it handling very large codebases cleanly. And let’s be honest: there’s always another stronger model coming out every month or so. Claiming the crown for a few weeks doesn’t make much sense.

That said, Claude 4 is still a very strong release. I’ll walk you through the most important details—features, use cases, benchmarks—and I’ll also run a few tests of my own.

We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:

Claude Sonnet 4

Claude Sonnet 4 is the smaller model in the Claude 4 family. It’s designed for general-purpose use and performs well across most common AI tasks—coding, writing, question answering, data analysis. It’s also available to free users, which makes it unusually accessible for a model of this quality.

The model supports a 200K context window, which allows it to handle large prompts and maintain continuity over long interactions. That’s useful for use cases like analyzing long documents, reviewing codebases, or generating multi-part responses with consistent structure. However, Sonnet 4 might struggle with large codebases. For comparison, Gemini 2.5 Flash has a context window of 1M tokens.

Compared to Claude Sonnet 3.7, this version is faster, better at following instructions, and more reliable in code-heavy workflows. It supports up to 64K output tokens, which helps with slightly longer outputs like structured plans, multi-part answers, or large code completions.

Early reports show fewer navigation errors and better performance across app development tasks. It’s not as strong as Opus 4 when it comes to complex reasoning or long-term task planning, but for most workflows, it’s more than sufficient.

Claude Opus 4

Claude Opus 4 is the flagship model in the Claude 4 series. It’s built for tasks that require deeper reasoning, long-term memory, and more structured outputs—things like agentic search, large-scale code refactoring, multi-step problem solving, and extended research workflows.

Like Sonnet 4, it supports a 200K context window, so this can be a disadvantage if you want to use it with a large codebase. For comparison, Gemini 2.5 Pro (Google’s flagship model) has a context window of 1M tokens.

It’s also capable of running in “extended thinking” mode, where it switches from fast responses to slower, more deliberate reasoning. This mode allows it to perform tool use, track memory across steps, and generate summaries of its own thought process when needed.

Anthropic has positioned it as a high-end model for developers, researchers, and teams building AI agents. It leads on SWE-bench Verified and Terminal-bench, and early users report strong performance in coding agents, search workflows, and multi-hour tasks like refactoring open-source projects or simulating long-term planning.

Unlike Sonnet, Opus 4 is only available on paid plans. It’s more expensive to run, and likely overkill for simple chatbot use. But for applications that need consistent reasoning across a lot of moving parts, it’s the more capable choice.

Testing Claude 4

When testing a new model, I typically use the same tasks—that way I can see how it compares to other models I’ve tested before. This isn’t meant to be an extensive evaluation. The goal is just to get a feel for how these models behave inside the chat interface.

Let’s try Claude 4 on two areas: math and coding.

Math

I like to start with a simple calculation that often confuses language models. This isn’t about checking basic arithmetic—I could just use a calculator for that. The point is to see how the model approaches a slightly tricky problem, and whether it can fall back on tool use or show its reasoning clearly when needed.

Let’s see how well Claude Sonnet 4 handled it:

As you can see, it got the answer wrong on the first attempt. But when I asked it to use a tool—a calculator—it responded by writing a one-line script in JavaScript and solved the problem correctly.

Claude Opus 4 answered correctly on the first attempt.

Next, I wanted to see how well Claude Sonnet 4 handles a more complex problem: use all digits from 0 to 9 exactly once to make three numbers x, y, z such that x + y = z.

After about five minutes of random brute-force attempts, I got a message saying the output limit had been reached and I needed to click “Continue” to resume. I did, and Claude tried again—but then hit the limit once more. What I appreciated, though, is that it didn’t make up an answer. It simply refused to answer if it couldn’t find one. That’s a big win, in my opinion—it’s more problematic to hallucinate a solution.

Then I tried Claude Opus 4 on the same task. The answer came back almost instantly, and it was correct: 246 + 789 = 1035. Opus 4 is impressive!!

Coding

For the coding task, I decided to go straight to Claude Opus 4. This kind of creative generation feels more suited to its capabilities. I’m not testing it on large codebases here—just a relatively trivial coding task.

I asked it to make a quick p5.js game using this prompt that I used for Gemini 2.5 Pro and o4-mini:

Prompt: Make me a captivating endless runner game. Key instructions on the screen. p5.js scene, no HTML. I like pixelated dinosaurs and interesting backgrounds.

Normally, I’d copy-paste the code into an online p5.js editor to test it. But one of the nice features in Claude 4 is Artifacts, which lets me view and run the code output directly inside the chat.

Let’s see the result:

Perfection! I’ve never gotten such a clean and playable version of this game from any other model.

Claude 4 Benchmarks

Claude 4 models were tested on a range of standard benchmarks across coding, reasoning, and agentic tasks. While these scores don’t tell the full story of model quality, they’re still useful as a point of comparison. Below are the key results for Claude Sonnet 4 and Claude Opus 4.

Source: Anthropic

Claude Sonnet 4

Claude Sonnet 4 performs surprisingly well for a model that’s available to free users. On SWE-bench Verified, which tests real-world coding tasks, it scores 72.7%, slightly edging out Opus 4 (72.5%) and significantly ahead of Claude 3.7 Sonnet (62.3%). It also outperforms OpenAI’s GPT-4.1 (54.6%) and Gemini 2.5 Pro (63.2%).

On other benchmarks:

TerminalBench (CLI-based coding): 35.5% – ahead of GPT-4.1 (30.3%) and Gemini (25.3%)
GPQA Diamond (graduate-level reasoning): 75.4% – strong, though slightly below OpenAI o3 and Gemini
TAU-bench (agentic tool use): 80.5% Retail / 60.0% Airline – comparable to Opus 4 and ahead of GPT-4.1 and o3
MMLU (multilingual QA): 86.5% – just behind Opus and o3, but still solid
MMMU (visual reasoning): 74.4% – the last score across the model lineup
AIME (math competition): 70.5% – better than Sonnet 3.7, but not competitive enough

Sonnet 4 is arguably one of the best-performing free-tier models currently available and competitive with models that require payment or commercial access.

Claude Opus 4

Opus 4 is Anthropic’s flagship model, and it performs at or near the top across most benchmarks. On SWE-bench Verified, it scores 72.5%, and in high-compute settings, that jumps to 79.4%—the highest among all compared models.

It also leads or ranks near the top on:

TerminalBench (agentic CLI coding): 43.2% (50.0% in high-compute mode) – the strongest score in the chart
GPQA Diamond (graduate-level reasoning): 79.6% (83.3%) – solid, slightly behind OpenAI o3 and Gemini 2.5 Pro
TAU-bench (agentic tool use): 81.4% Retail / 59.6% Airline – on par with Sonnet 4 and 3.7
MMLU (multilingual QA): 88.8% – tied with OpenAI o3
MMMU (visual reasoning): 76.5% – behind o3 and Gemini 2.5 Pro
AIME (math competition): 75.5% (90.0% high-compute) – significantly above Claude Sonnet 4

How to Access Claude 4

Claude 4 is available through multiple channels, depending on how you want to use it—whether that’s casual chat, development via API, or integration into enterprise workflows. Here’s how access works:

Chat access

You can use Claude 4 directly through the Claude.ai web interface or mobile apps (iOS and Android).

Claude Sonnet 4 is available to all users, including those on the free tier. This makes it one of the most capable models you can try without paying.
Claude Opus 4 is available only to paying users on the Pro, Max, Team, or Enterprise plans .

API access

For developers, both models can be accessed via the Anthropic API, and are also available on Amazon Bedrock, and Google Cloud Vertex AI.

API pricing (as of May 2025):

Claude Opus 4: $15 per million input tokens, $75 per million output tokens
Claude Sonnet 4: $3 per million input tokens, $15 per million output tokens

Batch processing and prompt caching can reduce costs by up to 90% in some cases.

Conclusion

Claude Sonnet 4 offers real value as a fast, capable model that’s free to use and performs well across coding, reasoning, and general assistant tasks. For most day-to-day needs, it’s more than enough.

Opus 4, on the other hand, is built for deeper reasoning and complex workflows. The coding results—especially in creative generation and problem-solving—were some of the best I’ve seen from any model so far.

Author

Alex Olteanu

Topics

Artificial Intelligence

Large Language Models

Learn AI with these courses!

Track

AI Fundamentals

0 min

Discover the fundamentals of AI, dive into models like ChatGPT, and decode generative AI secrets to navigate the dynamic AI landscape.

See Details

Start Course

Track

Developing AI Applications

0 min

Learn to create AI-powered applications with the latest AI developer tools, including the OpenAI API, Hugging Face, and LangChain.

See Details

Start Course

Course

Working with DeepSeek in Python

3 hr

877

Discover what all of the DeepSeek hype was really about! Build applications using DeepSeek's R1 and V3 models.

See Details

Start Course

blog

Claude 3.7 Sonnet: Features, Access, Benchmarks & More

Learn about Claude 3.7 Sonnet's hybrid approach of combining reasoning mode and generalist mode, key benchmarks, and how to access it via web or API.

Alex Olteanu

8 min

blog

What Is Claude 3.5 Sonnet? How It Works, Use Cases, and Artifacts

Claude 3.5 Sonnet outperforms GPT-4o and Gemini Pro 1.5 in several benchmarks and introduces a cool new feature: Artifacts.

Alex Olteanu

8 min

blog

Meta's Llama 4: Features, Access, How It Works, and More

Learn about the Llama 4 suite of large language models, including Llama 4 Scout, Llama 4 Maverick, and the in-training Llama 4 Behemoth.

Alex Olteanu

8 min

Tutorial

Claude 3.7 Sonnet API: A Guide With Demo Project

Learn how to use Claude 3.7 Sonnet's API by building a Gradio-based Research Paper Analyzer to extract insights from uploaded PDFs.

Aashi Dutt

Tutorial

Getting Started with Claude 3 and the Claude 3 API

Learn about the Claude 3 models, detailed performance benchmarks, and how to access them. Additionally, discover the new Claude 3 Python API for generating text, accessing vision capabilities, and streaming.

Abid Ali Awan

code-along

Introduction to Claude

Aimée, a Learning Solutions Architect at DataCamp, takes you through how to use Claude. You'll get prompt engineering tips, see a data analysis workflow, and learn how to generate Python and SQL code with Claude.

Aimée Gott

See More See More

Claude Sonnet 4

Claude Opus 4

Testing Claude 4

Math

Coding

Claude 4 Benchmarks

Claude Sonnet 4

Claude Opus 4

How to Access Claude 4

Chat access

API access

Conclusion

Claude 3.7 Sonnet: Features, Access, Benchmarks & More

What Is Claude 3.5 Sonnet? How It Works, Use Cases, and Artifacts

Meta's Llama 4: Features, Access, How It Works, and More

Claude 3.7 Sonnet API: A Guide With Demo Project

Getting Started with Claude 3 and the Claude 3 API

Introduction to Claude

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}AI Fundamentals

Developing AI Applications

Working with DeepSeek in Python

Claude 3.7 Sonnet: Features, Access, Benchmarks & More

What Is Claude 3.5 Sonnet? How It Works, Use Cases, and Artifacts

Meta's Llama 4: Features, Access, How It Works, and More

Claude 3.7 Sonnet API: A Guide With Demo Project

Getting Started with Claude 3 and the Claude 3 API

Introduction to Claude

AI Fundamentals