GPT 4.1: Features, Access, GPT-4o Comparison, and More

Learn about OpenAI's new GPT-4.1 family of models: GPT-4.1, GPT-4.1 Mini, and GPT-4.1 Nano.

Updated May 19, 2025 · 8 min read

OpenAI recently released GPT-4.1, a new family of models built specifically for coding tasks. I’m just as confused as everyone else about the naming jump from GPT-4.5 to GPT-4.1, but thankfully, the benchmarks don’t also go backward—on the contrary.

The rollout began on April 14 with API-only access. Then, on May 14, OpenAI started bringing GPT-4.1 into the ChatGPT app. Free-tier users won’t be able to pick GPT-4.1 manually, but they now benefit from GPT-4.1 Mini as the new default fallback, replacing GPT-4o Mini.

GPT-4.1 comes in three sizes: GPT-4.1, GPT-4.1 Mini, and GPT-4.1 Nano. All three support up to 1 million tokens of context and bring notable improvements in coding, instruction following, and long-context comprehension. They’re also cheaper and faster than previous releases.

In this article, I’ll walk you through what each model can do, how it compares to GPT-4o and GPT-4.5, and where it stands in benchmarks and real-world use.

We keep our readers updated on the latest in AI by sending out The Median, our free Friday newsletter that breaks down the week’s key stories. Subscribe and stay sharp in just a few minutes a week:

What Is GPT-4.1?

The GPT-4.1 model suite consists of three models: GPT-4.1, GPT-4.1 Mini, and GPT-4.1 Nano. They are aimed at developers who need better performance, longer context, and more predictable instruction following. Each model supports up to 1 million tokens of context, a big leap from the 128K limit in previous versions like GPT-4o.

Source: OpenAI

Despite the shared architecture, each version is tuned for different use cases. Let’s explore each in more detail.

GPT-4.1

This is the flagship model. If you want the best overall performance across coding, instruction following, and long-context tasks, this is the one to use. It’s built to handle complex coding workflows or process large documents in a single prompt.

In benchmarks, it outperforms GPT-4o in real-world software engineering (SWE-bench), instruction following (MultiChallenge), and long-context reasoning (MRCR, Graphwalks). It’s also noticeably better at respecting structure and formatting—think XML responses, ordered instructions, and negative constraints like “don’t answer unless…”.

You can also fine-tune GPT-4.1 as of launch day, which opens it up to more production use cases where control over tone, format, or domain knowledge matters.

GPT-4.1 Mini

GPT-4.1 Mini is the mid-tier option, offering nearly the same capabilities as the full model but with lower latency and cost. It matches or beats GPT-4o in many benchmarks, including instruction following and image-based reasoning.

It’s likely to become the default choice for many use cases: fast enough for interactive tools, smart enough to follow detailed instructions, and significantly cheaper than the full model.

Like the full version, it supports 1 million tokens of context and is already available for fine-tuning.

GPT-4.1 Nano

Nano is the smallest, fastest, and cheapest of the bunch. It’s built for tasks like autocomplete, classification, and extracting information from large documents. Despite being lightweight, it still supports the full 1 million-token context window.

It’s also OpenAI’s smallest, fastest, and cheapest model ever, at just about 10 cents per million tokens. You don’t get the full reasoning and planning ability of the larger models, but for certain tasks, that’s not the point.

GPT-4.1 vs GPT-4o and GPT-4.5

Before we get into the benchmarks (which we’ll cover in detail in the next section), it’s worth understanding how GPT-4.1 differs in practice from GPT-4o and GPT-4.5.

Smarter than GPT-4o, more practical than GPT-4.5

GPT-4.1 improves on GPT-4o’s capabilities while keeping latency in roughly the same range. In practice, it means developers now get better performance without paying a cost in responsiveness.

Source: OpenAI

Let’s break down the graph above:

GPT-4.1 and GPT-4o land in roughly the same spot for latency, but GPT-4.1 pulls ahead in intelligence. So you’re getting a smarter (and cheaper) model at a similar speed.
GPT-4.1 Mini is also more capable than GPT-4o Mini, but that boost comes with a slight tradeoff: a small increase in latency.
GPT-4.1 Nano is the clear winner on speed, sitting farthest left on the graph. It’s not as strong on MMLU, but that’s expected—it’s built for lightweight tasks where speed matters more than raw reasoning.

GPT-4.5, on the other hand, was always positioned as a research preview. While it had strong reasoning and writing quality, it came with more overhead. GPT-4.1 delivers similar or better results on key benchmarks but is cheaper and more responsive—enough that OpenAI plans to retire 4.5 entirely by mid-July to free up more GPUs.

1 million tokens of context

All three GPT-4.1 models—standard, Mini, and Nano—support up to 1 million tokens of context. That’s more than 8x what GPT-4o offered.

This long-context capacity enables practical use cases like processing entire logs, indexing code repositories, handling multi-document legal workflows, or analyzing long transcripts—all without needing to chunk or summarize beforehand.

Better instruction following

GPT-4.1 also marks a shift in how reliably the models follow instructions. It handles complex prompts involving ordered steps, formatting constraints, and negative conditions (like refusing to answer if formatting is wrong).

In practice, that means two things: less time spent crafting prompts, and less time cleaning up the output afterward.

GPT-4.1 Benchmarks

GPT-4.1 shows progress across four core areas: coding, instruction following, long-context comprehension, and multimodal tasks.

Coding performance

On SWE-bench Verified—a benchmark that drops the model into a real codebase and asks it to complete issues end-to-end—GPT-4.1 scores 54.6%. That’s up from 33.2% for GPT-4o and 38% for GPT-4.5. It’s also very impressive that GPT-4.1 scores higher than o1 and o3-mini.

Source: OpenAI

It also more than doubles GPT-4o’s performance on Aider’s polyglot diff benchmark, reaching 52.9% accuracy on code diffs across multiple languages and formats. GPT-4.5 scored 44.9% on the same task. GPT-4.1 is also more precise: in internal evals, extraneous code edits dropped from 9% (GPT-4o) to just 2%.

Source: OpenAI

Outside of benchmark scores, the frontend coding demo OpenAI offered is a good visual example of GPT-4.1’s superior performance. OpenAI’s team asked both models to build the same flashcard app, and human raters preferred GPT-4.1’s output 80% of the time.

Source: OpenAI

Windsurf, one of the alpha testers, reported a 60% improvement on their own internal coding benchmark. Another company, Qodo, tested GPT-4.1 on real GitHub pull requests and found it produced better suggestions 55% of the time, with fewer irrelevant or overly verbose edits.

Instruction following

GPT-4.1 is more literal—and more reliable—when it comes to following instructions, especially for tasks involving multiple steps, formatting rules, or conditions. On OpenAI’s internal instruction following eval (hard subset), GPT-4.1 scored 49.1%, compared to just 29.2% for GPT-4o. GPT-4.5 is slightly ahead here at 54%, but the gap between 4.1 and 4o is significant.

Source: OpenAI

On MultiChallenge, which tests whether a model can follow multi-turn instructions and remember constraints introduced earlier in the conversation, GPT-4.1 scores 38.3%—up from 27.8% for GPT-4o. And on IFEval, which tests compliance with clearly specified output requirements, GPT-4.1 hits 87.4%, a solid improvement over GPT-4o’s 81%.

In practice, this means GPT-4.1 is better at sticking to ordered steps, rejecting malformed inputs, and responding in the format you asked for—especially in structured outputs like XML, YAML, or markdown. That also makes it easier to build reliable agent workflows without lots of prompt retries.

Long-context reasoning

All three GPT-4.1 models—standard, Mini, and Nano—support up to 1 million tokens of context. That’s an 8x increase over GPT-4o, which topped out at 128K. Just as important: there’s no extra cost for using that context window. It’s priced like any other prompt.

But can the models actually use all that context? In OpenAI’s needle-in-a-haystack eval, GPT-4.1 reliably found inserted content placed at any point—start, middle, or end—within the full 1M-token input.

Source: OpenAI

Graphwalks, a benchmark that tests multi-hop reasoning in long contexts, puts GPT-4.1 at 61.7%—a solid jump from GPT-4o’s 41.7%, though still below GPT-4.5 at 72.3%.

These improvements show up in real-world tests, too. Thomson Reuters saw a 17% boost in multi-document legal analysis using GPT-4.1, while Carlyle reported a 50% improvement in extracting granular data from dense financial reports.

Multimodal and vision tasks

On multimodal tasks, GPT-4.1 also makes progress. It scored 72.0% on the Video-MME benchmark, which involves answering questions about 30–60 minute videos with no subtitles—up from 65.3% with GPT-4o.

On image-heavy benchmarks like MMMU, it reached 74.8% vs. 68.7% for GPT-4o. On MathVista, which includes charts, graphs, and math visuals, GPT-4.1 hit 72.2%.

One surprise: GPT-4.1 Mini performs almost as well as the full version on some of these benchmarks. On MathVista, for instance, it slightly outscored GPT-4.1 at 73.1%. That makes it a compelling choice for use cases that combine speed with vision-heavy prompts.

Source: OpenAI

How to Access GPT-4.1

You can now access GPT-4.1 and GPT-4.1 Mini directly in the ChatGPT app, not just through the API. Plus, Pro, and Team subscribers can manually select GPT-4.1 from the model menu, while free users automatically fall back to GPT-4.1 Mini—replacing GPT-4o Mini as the default behind the scenes. Enterprise and Education plans are expected to receive access in the coming weeks.

For developers, the OpenAI API and Playground continue to support all three variants—GPT-4.1, GPT-4.1 Mini, and GPT-4.1 Nano. This remains the best way to test prompts, explore long-context behavior, and compare models before integrating them into production.

If you’re working with long documents—think logs, PDFs, legal records, or academic articles—you can send up to 1 million tokens in a single call, no special parameter needed. There’s also no pricing bump for long context: token costs are flat, regardless of input size.

Fine-tuning GPT-4.1

You can fine-tune all three GPT-4.1 variants. That opens the door for custom instructions, domain-specific vocabulary, or tone-specific outputs. Note that fine-tuning has a slightly larger pricing:

Model	Input	Cached Input	Output	Training
GPT-4.1	$3.00 / 1M tokens	$0.75 / 1M tokens	$12.00 / 1M tokens	$25.00 / 1M tokens
GPT-4.1 Mini	$0.80 / 1M tokens	$0.20 / 1M tokens	$3.20 / 1M tokens	$5.00 / 1M tokens
GPT-4.1 Nano	$0.20 / 1M tokens	$0.05 / 1M tokens	$0.80 / 1M tokens	$1.50 / 1M tokens

If you’ve previously fine-tuned GPT-3.5 or GPT-4 models, the process remains mostly the same—just pick the newer base. If you want to learn more, I recommend this tutorial on fine-tuning GPT-4o mini.

GPT-4.1 pricing

One of the more welcome updates with GPT-4.1 is that it’s not just smarter—it’s also cheaper. OpenAI says the goal was to make these models more usable across more real-world workflows, and that shows in how pricing is structured.

Here’s how the three models are priced for inference:

Model	Input	Cached Input	Output	Blended Avg. Cost*
GPT-4.1	$2.00 / 1M tokens	$0.50 / 1M tokens	$8.00 / 1M tokens	$1.84
GPT-4.1 Mini	$0.40 / 1M tokens	$0.10 / 1M tokens	$1.60 / 1M tokens	$0.42
GPT-4.1 Nano	$0.10 / 1M tokens	$0.025 / 1M tokens	$0.40 / 1M tokens	$0.12

*The “blended” number is based on OpenAI’s assumption of typical input/output ratios.

Conclusion

GPT-4.1 comes with more reliable code generation, better instruction following, true long-context processing, and faster iteration.

The naming might be confusing, but the models themselves are clearly more capable than what came before. They’re also more affordable—and more usable—especially in production environments where latency, cost, and predictability matter.

If you’re working with GPT-4o today, it’s worth testing GPT-4.1.

Is GPT-4.1 available for users in ChatGPT?

Does GPT-4.1 support image generation or voice output?

Can I use GPT-4.1 in apps like Cursor or Windsurf?

Author

Alex Olteanu

Topics

Artificial Intelligence

Large Language Models

OpenAI

AI News

Learn AI with these courses!

Course

Working with the OpenAI API

3 hr

97.8K

Start your journey developing AI-powered applications with the OpenAI API. Learn about the functionality that underpins popular AI applications like ChatGPT.

See Details

Start Course

Course

Developing AI Systems with the OpenAI API

3 hr

15.7K

Leverage the OpenAI API to get your AI applications ready for production.

See Details

Start Course

Course

Multi-Modal Systems with the OpenAI API

2 hr

2.3K

Create multi-modal systems using OpenAI's text and audio models, including an end-to-end customer support chatbot!

See Details

Start Course

blog

ChatGPT 4.5: Features, Access, GPT-4o Comparison, and More

Learn how ChatGPT 4.5 from OpenAI excels in conversational abilities and accuracy compared to o1 and GPT-4o, but may not be as strong in complex reasoning tasks.

Alex Olteanu

8 min

blog

What is GPT-4 and Why Does it Matter?

OpenAI has announced the release of its latest large language model, GPT-4. This model is a large multimodal model that can accept both image and text inputs and generate text outputs.

Abid Ali Awan

9 min

blog

What Is GPT-4o Mini? How It Works, Use Cases, API & More

GPT-4o mini is a smaller, more affordable version of OpenAI's GPT-4o model, offering a balance of performance and cost-efficiency for various AI applications.

Ryan Ong

8 min

blog

OpenAI's O4-Mini: Tests, Features, O3 Comparison, and More

Learn about OpenAI's new o4-mini reasoning model, its capabilities, performance benchmarks, cost-effectiveness, and how it compares to other models like o3.

Alex Olteanu

8 min

blog

GPT-5: New Features, Tests, Benchmarks, and More

Learn about GPT-5's new features, performance benchmarks, and how it consolidates previous OpenAI models into a unified user experience.

Alex Olteanu

8 min

blog

12 GPT-4 Open-Source Alternatives

GPT-4 open-source alternatives that can offer similar performance and require fewer computational resources to run. These projects come with instructions, code sources, model weights, datasets, and chatbot UI.

Abid Ali Awan

9 min

See More See More

What Is GPT-4.1?

GPT-4.1

GPT-4.1 Mini

GPT-4.1 Nano

GPT-4.1 vs GPT-4o and GPT-4.5

Smarter than GPT-4o, more practical than GPT-4.5

1 million tokens of context

Better instruction following

GPT-4.1 Benchmarks

Coding performance

Instruction following

Long-context reasoning

Multimodal and vision tasks

How to Access GPT-4.1

Fine-tuning GPT-4.1

GPT-4.1 pricing

Conclusion

FAQs

Can I use GPT-4.1 in apps like Cursor or Windsurf?

ChatGPT 4.5: Features, Access, GPT-4o Comparison, and More

What is GPT-4 and Why Does it Matter?

What Is GPT-4o Mini? How It Works, Use Cases, API & More

OpenAI's O4-Mini: Tests, Features, O3 Comparison, and More

GPT-5: New Features, Tests, Benchmarks, and More

12 GPT-4 Open-Source Alternatives

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Working with the OpenAI API

Developing AI Systems with the OpenAI API

Multi-Modal Systems with the OpenAI API

ChatGPT 4.5: Features, Access, GPT-4o Comparison, and More

What is GPT-4 and Why Does it Matter?

What Is GPT-4o Mini? How It Works, Use Cases, API & More

OpenAI's O4-Mini: Tests, Features, O3 Comparison, and More

GPT-5: New Features, Tests, Benchmarks, and More

12 GPT-4 Open-Source Alternatives

Working with the OpenAI API