GPT-5.3 Codex: From Coding Assistant to General Work Agent

We explore GPT-5.3-Codex, OpenAI’s new general agent. Learn about its self-healing infrastructure, real-time collaboration, and how it performs on benchmarks.

Feb 6, 2026 · 9 min read

In sports, you would call this a counterattack. Just 30 minutes after Anthropic published their new Claude Opus 4.6 model, OpenAI released a major update as well.

Their new GPT-5.3-Codex model replaces both GPT-5.2 and GPT-5.2-Codex. Its main focus is on combining the strengths of these two legacy models to provide a more general agentic experience. In combination with the Codex app for macOS, introduced only a few days earlier, it also enables interactive, real-time collaboration without the risk of losing context.

In this article, we will cover all the new features, take a look at the benchmarks, and see how GPT-5.3-Codex works in a couple of hands-on examples. We will also try to examine how well the model actually performs and how it compares to Anthropic’s Claude Opus 4.6.

If you are interested in learning more about OpenAI’s latest features, I recommend reading our guides on ChatGPT Images and ChatGPT Health.

What is GPT 5.3 Codex?

GPT-5.3-Codex is OpenAI’s newest large language model (LLM), following up on GPT-5.2 and GPT-5.2-Codex, which were both released in December 2025.

In contrast to these two legacy models, the new release takes a new approach. While there was a clear distinction between coding agent and reasoning LLM in the GPT-5.2 models, GPT-5.3-Codex merges them and is introduced as a general-purpose agent excelling at both.

The GPT-5.3-Codex model is supposed to not just write functions, but also understand the work around the code. Think of updating Jira tickets, writing documentation, or managing deployment pipelines.

Performance-wise, the new model almost doubles its score in the OSWorld-Verified benchmark and sets new high scores for both SWE-Bench Pro and Terminal-Bench. Additionally, OpenAI focused on efficiency and claims that the new model will be 25% faster due to improvements in infrastructure and the inference stack.

One notable thing is that OpenAI apparently used GPT-5.3-Codex to actively debug and manage its own creation. While other frontier models like Gemini 3 generated their own training data, Codex went a step further by acting as a site reliability engineer: monitoring its own training runs, diagnosing infrastructure errors, and writing scripts to dynamically scale GPU clusters during launch.

Key Features of GPT 5.3 Codex

The release of GPT-5.3-Codex focused on enabling general agentic workflows. Let’s take a look at some key features.

The general work agent

In contrast to its Codex predecessor, GPT-5.3-Codex is designed to be a general work agent. The aim is to transcend the IDE, with the model effectively handling “knowledge work” alongside “coding work.”

The new model is built to support all work across the software lifecycle:

Engineering and operations: Handling the technical "heavy lifting" like debugging, testing, deploying, and ongoing monitoring of systems.
Product and planning: Supporting the strategic side of development by writing product requirements documentation and assisting with user research.
Analysis and communication: Managing the "soft skills" of software delivery, including editing copy and tracking project metrics.

This versatility enables GPT-5.3-Codex to execute end-to-end workflows. The model could, for instance, write an SQL query, fetch the data, and then generate a PDF report or slide deck based on it via tool calls.

Interactive real-time collaborator

The interactive collaborator feature is the biggest perk of the Codex app and has the potential to make the biggest difference in everyday work. It keeps you in the loop throughout the process and lets you intervene in real time.

Essentially, GPT‑5.3-Codex constantly lets you know what it is doing and offers you the chance to steer it in the right direction long before you receive the final output. Instead of waiting, you can ask questions, give feedback, or add context to your initial prompt. The model then responds to your feedback and adapts mid-stream.

Currently, the Codex app is only available for macOS. You can turn on steering in the app settings under General > Follow-up behavior.

Cybersecurity focus

OpenAI also shifted its focus to cybersecurity, particularly to vulnerability detection. GPT-5.3-Codex is the first model classified as "high capability" under OpenAI’s Preparedness Framework, meaning it is specifically trained to identify and fix software vulnerabilities.

To balance this power with safety, OpenAI has deployed a defensive stack designed to prevent misuse, such as automating cyberattacks. It includes safety training, real-time monitoring, and Trusted Access for Cyber, a pilot program that gates advanced capabilities to verified researchers.

Furthermore, OpenAI is investing heavily in the ecosystem, launching the Aardvark security agent (currently in beta) and committing $10M in API credits to support open-source maintainers with free code scanning tools.

GPT 5.3 Codex Benchmarks

While we are still waiting for verified results in many of the state-of-the-art benchmarks, the announcement featured scores in several areas:

Agentic workflows: OSWorld-Verified
General coding: SWE-Bench Pro
Agentic coding: Terminal-Bench 2.0
Reasoning: GDPval

Agentic workflows

OSWorld-Verified is the gold-standard benchmark for testing an AI's ability to operate a computer like a human. It goes beyond simple text processing by placing the AI in a real virtual machine and asking it to complete open-ended tasks using a mouse, keyboard, and GUI apps (e.g., "Open LibreOffice, create a spreadsheet with this data, and save it as a PDF").

GPT-5.3-Codex achieves 64.7% in the OSWorld-Verified benchmark. That’s a staggering increase of 26.5 percentage points compared to its predecessor, GPT-5.2-Codex. This strong result reflects OpenAI’s focus on creating a more general, agentic experience for GPT-5.3-Codex, optimized for good performance across tasks and domains.

Coding

Software development was the initial focus of the Codex models. On the SWE-bench Pro (Public), GPT-5.3-Codex reaches 56.8%, only a minor increase from 56.4% with GPT-5.2-Codex. The incremental improvement here is likely the trade-off made in optimizing for agentic skills.

On the agentic coding side, we can see a quite significant jump: GPT-5.3-Codex scores 75.1% on Terminal-Bench 2.0, a substantial increase from the 64% with GPT-5.2-Codex. Even more interesting, it topped the result of Claude Opus 4.6, which had claimed to top the benchmark just half an hour earlier, by over 5 percentage points!

Image source

Reasoning

For the model’s reasoning skills, there’s not really anything exciting to report. GPT-5.3-Codex reaches exactly the same result as GPT-5.2 on GDPval (70.9%). It’s fair to interpret this in a way that the (good) reasoning skills of GPT-5.2 were incorporated into the Codex model, without focusing on substantial improvement in this area.

How Can I Access GPT 5.3 Codex?

OpenAI announced that GPT-5.3-Codex is now available with all paid ChatGPT tiers in the app, from the CLI, via IDE extension, and on the web.

The model is not yet available in the OpenAI API, but API access will follow “soon”. There aren’t any details on the pricing per token yet.

GPT 5.3 Codex vs. Claude Opus 4.6

The biggest competition for GPT-5.3-Codex in the arena of software development-focused agents is arguably Claude Opus 4.6. Let’s see how the two compare.

General approach and agentic style

The approaches of OpenAI and Anthropic are not entirely different, but there are some nuances to note.

GPT-5.3-Codex is positioned as a rather autonomous builder, optimized for speed (25% faster) and "self-correcting" loops to finish engineering tasks without human help.

On the other hand, Claude Opus 4.6 is designed for deep thinking, with its massive context window (1M tokens) and "adaptive thought" helping it handle complex, messy legacy projects.

The agentic style of both models is focused on interaction, though in slightly different ways. GPT-5.3-Codex’s “steerability” lets users interrupt it mid-task to change direction (e.g., "Wait, use the v2 API instead") without breaking the workflow.

Claude Opus 4.6 acts more like a senior partner that you converse with, offering "High/Medium/Low" effort settings to manage costs and depth.

While GPT-5.3-Codex was specifically optimized for NVIDIA GB200 NVL72 hardware to reduce latency in agentic loops, Claude Opus 4.6 focuses on software-side optimizations like conversation compaction to manage long histories efficiently.

Benchmarks and performance

Benchmark-wise, it is hard to compare the two models. The only benchmark for which we have scores for both models is Terminal-Bench 2.0, where GPT-5.3-Codex (75.1%) outperforms Claude Opus 4.6 (69.9%).

It suggests that while Claude may be a deeper thinker, GPT-5.3-Codex is the more capable "hands-on" operator for executing dev tasks in a real environment, such as navigating file systems, managing dependencies, or running builds.

Apart from that, they are hard to compare because the two companies made different choices regarding the benchmarks to include in their release notes. This divergence likely reflects a strategic choice by both labs to highlight their specific strengths while avoiding direct comparisons where they might not claim the #1 title.

Here’s an overview of what we know:

Feature / Category	GPT-5.3-Codex (OpenAI)	Claude Opus 4.6 (Anthropic)
General Approach	Autonomous builder: Optimized for speed (25% faster) and "self-correcting" loops to finish engineering tasks independently.	Deep thinker: Uses "adaptive thought" to handle complex, messy legacy projects.
Agentic Style	Steerable: Allows users to interrupt mid-task to change direction without breaking workflow	Senior partner: A conversational style with "High/Medium/Low" effort settings to manage costs and depth
Optimization Focus	Hardware-side: Optimized for NVIDIA GB200 NVL72 hardware to reduce latency in agentic loops.	Software-side: Focuses on conversation compaction to efficiently manage long histories
Key Specs	Speed-focused architecture	1M Token context window
Terminal-Bench 2.0	75.1%: Superior "hands-on" operator for executing dev tasks (file systems, builds, dependencies)	69.9%: Scores lower on execution tasks, leaning more towards reasoning than operation

GPT-5.3 Codex Use Cases

The key features we introduced earlier make GPT-5.3-Codex a perfect fit for a couple of use cases:

Self-healing infrastructure: An agent that monitors logs, identifies a crash, fixes the code, and redeploys without human input
Legacy migration: translating from legacy languages like COBOL to modern stacks, where the general agent can rewrite the documentation simultaneously
Cybersecurity: The first model rated "High capability" for security tasks. Great for automated penetration testing and patching

Final Thoughts

As good a coder as GPT-5.2-Codex, as good a thinker as GPT-5.2, but still much more than that: With GPT-5.3-Codex, OpenAI has made the step away from isolated models and towards a capable general-purpose agent. While there are many evaluations still to be made, its first benchmark results look promising.

The interactive collaboration feature is very neat, but for now, it is limited to the macOS Codex app. Users also still need to wait for API access.

Being optimized for speed and autonomous creation, GPT-5.3-Codex takes a different approach than Claude Opus 4.6 and outperforms it on Terminal-Bench 2.0, but the detailed differences in performance are still hard to assess, as both models have just dropped. Time will tell what the fuller picture of the comparison will look like.

If you’re interested in learning more about the concepts and capabilities of agentic tools, I recommend enrolling in our AI Agent Fundamentals skill track.

What is GPT-5.3-Codex?

How does the "interactive collaborator" feature work?

How does GPT-5.3-Codex compare to Claude Opus 4.6?

Is GPT-5.3-Codex safe for cybersecurity tasks?

How can I access GPT-5.3-Codex?

Author

Tom Farnschläder

Topics

Artificial Intelligence

OpenAI

Large Language Models

Top DataCamp Courses

Course

Working with the OpenAI API

3 hr

103.1K

Start your journey developing AI-powered applications with the OpenAI API. Learn about the functionality that underpins popular AI applications like ChatGPT.

See Details

Start Course

Course

Prompt Engineering with the OpenAI API

4 hr

39.1K

Dive deep into the principles and best practices of prompt engineering to leverage powerful language models like ChatGPT to solve real-world problems.

See Details

Start Course

Course

Multi-Modal Systems with the OpenAI API

2 hr

2.4K

Create multi-modal systems using OpenAI's text and audio models, including an end-to-end customer support chatbot!

See Details

Start Course

blog

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

Discover how GPT-5.2 improves knowledge work with major upgrades in long-context reasoning, tool calling, coding, vision, and end-to-end workflow execution.

Josef Waples

10 min

blog

GPT-5: New Features, Tests, Benchmarks, and More

Learn about GPT-5's new features, performance benchmarks, and how it consolidates previous OpenAI models into a unified user experience.

Alex Olteanu

8 min

blog

Everything We Know About GPT-5

Learn how GPT-5 will evolve into a unified system with advanced features, targeting a summer 2025 release, based on OpenAI’s latest roadmap and GPT history.

Josep Ferrer

8 min

blog

GPT-3 and the Next Generation of AI-Powered Services

How GPT-3 expands the world of possibilities for language tasks—and why it will pave the way for designers to prototype more easily, streamline work for data analysts, enable more robust research, and automate content generation.

Adel Nehme

7 min

Tutorial

GPT-5.1 Codex Guide With Hands-On Project: Building a GitHub Issue Analyzer Agent

In this GPT-5.1-Codex tutorial, you’ll transform GitHub issues into real engineering plans using GitHub CLI, FireCrawl API, and OpenAI Agents.

Abid Ali Awan

Tutorial

OpenAI's Codex: A Guide With 3 Practical Examples

Learn what OpenAI's Codex is and how to use it inside ChatGPT to perform coding tasks on a GitHub repository.

Aashi Dutt

See More See More

What is GPT 5.3 Codex?

Key Features of GPT 5.3 Codex

The general work agent

Interactive real-time collaborator

Cybersecurity focus

GPT 5.3 Codex Benchmarks

Agentic workflows

Coding

Reasoning

How Can I Access GPT 5.3 Codex?

GPT 5.3 Codex vs. Claude Opus 4.6

General approach and agentic style

Benchmarks and performance

GPT-5.3 Codex Use Cases

Final Thoughts

GPT-5.3-Codex FAQs

How does GPT-5.3-Codex compare to Claude Opus 4.6?

Is GPT-5.3-Codex safe for cybersecurity tasks?

How can I access GPT-5.3-Codex?

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

GPT-5: New Features, Tests, Benchmarks, and More

Everything We Know About GPT-5

GPT-3 and the Next Generation of AI-Powered Services

GPT-5.1 Codex Guide With Hands-On Project: Building a GitHub Issue Analyzer Agent

OpenAI's Codex: A Guide With 3 Practical Examples

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Working with the OpenAI API

Prompt Engineering with the OpenAI API

Multi-Modal Systems with the OpenAI API

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

GPT-5: New Features, Tests, Benchmarks, and More

Everything We Know About GPT-5

GPT-3 and the Next Generation of AI-Powered Services

GPT-5.1 Codex Guide With Hands-On Project: Building a GitHub Issue Analyzer Agent

OpenAI's Codex: A Guide With 3 Practical Examples

Working with the OpenAI API