Course
In sports, you would call this a counterattack. Just 30 minutes after Anthropic published their new Claude Opus 4.6 model, OpenAI released a major update as well.
Their new GPT-5.3-Codex model replaces both GPT-5.2 and GPT-5.2-Codex. Its main focus is on combining the strengths of these two legacy models to provide a more general agentic experience. In combination with the Codex app for macOS, introduced only a few days earlier, it also enables interactive, real-time collaboration without the risk of losing context.
In this article, we will cover all the new features, take a look at the benchmarks, and see how GPT-5.3-Codex works in a couple of hands-on examples. We will also try to examine how well the model actually performs and how it compares to Anthropic’s Claude Opus 4.6.
If you are interested in learning more about OpenAI’s latest features, I recommend reading our guides on ChatGPT Images and ChatGPT Health.
What is GPT 5.3 Codex?
GPT-5.3-Codex is OpenAI’s newest large language model (LLM), following up on GPT-5.2 and GPT-5.2-Codex, which were both released in December 2025.
In contrast to these two legacy models, the new release takes a new approach. While there was a clear distinction between coding agent and reasoning LLM in the GPT-5.2 models, GPT-5.3-Codex merges them and is introduced as a general-purpose agent excelling at both.
The GPT-5.3-Codex model is supposed to not just write functions, but also understand the work around the code. Think of updating Jira tickets, writing documentation, or managing deployment pipelines.
Performance-wise, the new model almost doubles its score in the OSWorld-Verified benchmark and sets new high scores for both SWE-Bench Pro and Terminal-Bench. Additionally, OpenAI focused on efficiency and claims that the new model will be 25% faster due to improvements in infrastructure and the inference stack.
One notable thing is that OpenAI apparently used GPT-5.3-Codex to actively debug and manage its own creation. While other frontier models like Gemini 3 generated their own training data, Codex went a step further by acting as a site reliability engineer: monitoring its own training runs, diagnosing infrastructure errors, and writing scripts to dynamically scale GPU clusters during launch.
Key Features of GPT 5.3 Codex
The release of GPT-5.3-Codex focused on enabling general agentic workflows. Let’s take a look at some key features.
The general work agent
In contrast to its Codex predecessor, GPT-5.3-Codex is designed to be a general work agent. The aim is to transcend the IDE, with the model effectively handling “knowledge work” alongside “coding work.”
The new model is built to support all work across the software lifecycle:
- Engineering and operations: Handling the technical "heavy lifting" like debugging, testing, deploying, and ongoing monitoring of systems.
- Product and planning: Supporting the strategic side of development by writing product requirements documentation and assisting with user research.
- Analysis and communication: Managing the "soft skills" of software delivery, including editing copy and tracking project metrics.
This versatility enables GPT-5.3-Codex to execute end-to-end workflows. The model could, for instance, write an SQL query, fetch the data, and then generate a PDF report or slide deck based on it via tool calls.
Interactive real-time collaborator
The interactive collaborator feature is the biggest perk of the Codex app and has the potential to make the biggest difference in everyday work. It keeps you in the loop throughout the process and lets you intervene in real time.
Essentially, GPT‑5.3-Codex constantly lets you know what it is doing and offers you the chance to steer it in the right direction long before you receive the final output. Instead of waiting, you can ask questions, give feedback, or add context to your initial prompt. The model then responds to your feedback and adapts mid-stream.
Currently, the Codex app is only available for macOS. You can turn on steering in the app settings under General > Follow-up behavior.
Cybersecurity focus
OpenAI also shifted its focus to cybersecurity, particularly to vulnerability detection. GPT-5.3-Codex is the first model classified as "high capability" under OpenAI’s Preparedness Framework, meaning it is specifically trained to identify and fix software vulnerabilities.
To balance this power with safety, OpenAI has deployed a defensive stack designed to prevent misuse, such as automating cyberattacks. It includes safety training, real-time monitoring, and Trusted Access for Cyber, a pilot program that gates advanced capabilities to verified researchers.
Furthermore, OpenAI is investing heavily in the ecosystem, launching the Aardvark security agent (currently in beta) and committing $10M in API credits to support open-source maintainers with free code scanning tools.
GPT 5.3 Codex Benchmarks
While we are still waiting for verified results in many of the state-of-the-art benchmarks, the announcement featured scores in several areas:
- Agentic workflows: OSWorld-Verified
- General coding: SWE-Bench Pro
- Agentic coding: Terminal-Bench 2.0
- Reasoning: GDPval
Agentic workflows
OSWorld-Verified is the gold-standard benchmark for testing an AI's ability to operate a computer like a human. It goes beyond simple text processing by placing the AI in a real virtual machine and asking it to complete open-ended tasks using a mouse, keyboard, and GUI apps (e.g., "Open LibreOffice, create a spreadsheet with this data, and save it as a PDF").
GPT-5.3-Codex achieves 64.7% in the OSWorld-Verified benchmark. That’s a staggering increase of 26.5 percentage points compared to its predecessor, GPT-5.2-Codex. This strong result reflects OpenAI’s focus on creating a more general, agentic experience for GPT-5.3-Codex, optimized for good performance across tasks and domains.
Coding
Software development was the initial focus of the Codex models. On the SWE-bench Pro (Public), GPT-5.3-Codex reaches 56.8%, only a minor increase from 56.4% with GPT-5.2-Codex. The incremental improvement here is likely the trade-off made in optimizing for agentic skills.
On the agentic coding side, we can see a quite significant jump: GPT-5.3-Codex scores 75.1% on Terminal-Bench 2.0, a substantial increase from the 64% with GPT-5.2-Codex. Even more interesting, it topped the result of Claude Opus 4.6, which had claimed to top the benchmark just half an hour earlier, by over 5 percentage points!
Reasoning
For the model’s reasoning skills, there’s not really anything exciting to report. GPT-5.3-Codex reaches exactly the same result as GPT-5.2 on GDPval (70.9%). It’s fair to interpret this in a way that the (good) reasoning skills of GPT-5.2 were incorporated into the Codex model, without focusing on substantial improvement in this area.
How Can I Access GPT 5.3 Codex?
OpenAI announced that GPT-5.3-Codex is now available with all paid ChatGPT tiers in the app, from the CLI, via IDE extension, and on the web.
The model is not yet available in the OpenAI API, but API access will follow “soon”. There aren’t any details on the pricing per token yet.
GPT 5.3 Codex vs. Claude Opus 4.6
The biggest competition for GPT-5.3-Codex in the arena of software development-focused agents is arguably Claude Opus 4.6. Let’s see how the two compare.
General approach and agentic style
The approaches of OpenAI and Anthropic are not entirely different, but there are some nuances to note.
GPT-5.3-Codex is positioned as a rather autonomous builder, optimized for speed (25% faster) and "self-correcting" loops to finish engineering tasks without human help.
On the other hand, Claude Opus 4.6 is designed for deep thinking, with its massive context window (1M tokens) and "adaptive thought" helping it handle complex, messy legacy projects.
The agentic style of both models is focused on interaction, though in slightly different ways. GPT-5.3-Codex’s “steerability” lets users interrupt it mid-task to change direction (e.g., "Wait, use the v2 API instead") without breaking the workflow.
Claude Opus 4.6 acts more like a senior partner that you converse with, offering "High/Medium/Low" effort settings to manage costs and depth.
While GPT-5.3-Codex was specifically optimized for NVIDIA GB200 NVL72 hardware to reduce latency in agentic loops, Claude Opus 4.6 focuses on software-side optimizations like conversation compaction to manage long histories efficiently.
Benchmarks and performance
Benchmark-wise, it is hard to compare the two models. The only benchmark for which we have scores for both models is Terminal-Bench 2.0, where GPT-5.3-Codex (75.1%) outperforms Claude Opus 4.6 (69.9%).
It suggests that while Claude may be a deeper thinker, GPT-5.3-Codex is the more capable "hands-on" operator for executing dev tasks in a real environment, such as navigating file systems, managing dependencies, or running builds.
Apart from that, they are hard to compare because the two companies made different choices regarding the benchmarks to include in their release notes. This divergence likely reflects a strategic choice by both labs to highlight their specific strengths while avoiding direct comparisons where they might not claim the #1 title.
Here’s an overview of what we know:
|
Feature / Category |
GPT-5.3-Codex (OpenAI) |
Claude Opus 4.6 (Anthropic) |
|
General Approach |
Autonomous builder: Optimized for speed (25% faster) and "self-correcting" loops to finish engineering tasks independently. |
Deep thinker: Uses "adaptive thought" to handle complex, messy legacy projects. |
|
Agentic Style |
Steerable: Allows users to interrupt mid-task to change direction without breaking workflow |
Senior partner: A conversational style with "High/Medium/Low" effort settings to manage costs and depth |
|
Optimization Focus |
Hardware-side: Optimized for NVIDIA GB200 NVL72 hardware to reduce latency in agentic loops. |
Software-side: Focuses on conversation compaction to efficiently manage long histories |
|
Key Specs |
Speed-focused architecture |
1M Token context window |
|
Terminal-Bench 2.0 |
75.1%: Superior "hands-on" operator for executing dev tasks (file systems, builds, dependencies) |
69.9%: Scores lower on execution tasks, leaning more towards reasoning than operation |
GPT-5.3 Codex Use Cases
The key features we introduced earlier make GPT-5.3-Codex a perfect fit for a couple of use cases:
- Self-healing infrastructure: An agent that monitors logs, identifies a crash, fixes the code, and redeploys without human input
- Legacy migration: translating from legacy languages like COBOL to modern stacks, where the general agent can rewrite the documentation simultaneously
- Cybersecurity: The first model rated "High capability" for security tasks. Great for automated penetration testing and patching
Final Thoughts
As good a coder as GPT-5.2-Codex, as good a thinker as GPT-5.2, but still much more than that: With GPT-5.3-Codex, OpenAI has made the step away from isolated models and towards a capable general-purpose agent. While there are many evaluations still to be made, its first benchmark results look promising.
The interactive collaboration feature is very neat, but for now, it is limited to the macOS Codex app. Users also still need to wait for API access.
Being optimized for speed and autonomous creation, GPT-5.3-Codex takes a different approach than Claude Opus 4.6 and outperforms it on Terminal-Bench 2.0, but the detailed differences in performance are still hard to assess, as both models have just dropped. Time will tell what the fuller picture of the comparison will look like.
If you’re interested in learning more about the concepts and capabilities of agentic tools, I recommend enrolling in our AI Agent Fundamentals skill track.
GPT-5.3-Codex FAQs
What is GPT-5.3-Codex?
It is OpenAI’s latest general-purpose agentic model, released in February 2026. It replaces the previous GPT-5.2-codex models by merging "coding agent" and "reasoning LLM" capabilities into a single model designed to handle end-to-end work, from writing code and debugging to updating Jira tickets and creating documentation.
How does the "interactive collaborator" feature work?
This feature allows you to steer the model in real-time while it works. Instead of waiting for the final result, you can watch its progress in the Codex macOS app and intervene mid-task to ask questions, provide feedback, or correct its course without breaking the workflow.
How does GPT-5.3-Codex compare to Claude Opus 4.6?
While Claude Opus 4.6 is positioned as a "deep thinker" for complex legacy projects, GPT-5.3-Codex is optimized as a faster, autonomous "builder." In benchmarks, GPT-5.3-Codex outperforms Claude on practical execution tasks (like Terminal-Bench 2.0) but may differ in reasoning style.
Is GPT-5.3-Codex safe for cybersecurity tasks?
Yes, but with guardrails. It is the first model classified as "high capability" for vulnerability detection under OpenAI’s Preparedness Framework. To ensure safety, OpenAI uses a defensive stack to prevent misuse (like automated attacks) and limits certain advanced capabilities to verified researchers.
How can I access GPT-5.3-Codex?
The model is currently available to all paid ChatGPT subscribers. You can access it via the web interface, the new Codex app for macOS, the Command Line Interface (CLI), or the IDE extension. API access is planned but not yet available.

Data Science Editor @ DataCamp | Forecasting things and building with APIs is my jam.

