OpenAI's GPT-Realtime-2: A Voice Model with GPT-5-Class Reasoning

OpenAI's three new audio models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — allow for live translation and streaming transcription in the Realtime API.

May 7, 2026 · 9 min read

Just two days after the release of GPT-5.5 Instant, OpenAI has more big news, with a new release that focuses on three things at once:

a voice model that can reason while it talks: that's GPT‑Realtime‑2
live translation across 70+ languages: GPT‑Realtime‑Translate
and finally, streaming transcription that keeps up with real conversations: GPT‑Realtime‑Whisper

In this article, we will catch you up to speed on the news with these three models.

If you're interested in a competitor model focused on live interaction, I recommend reading our guide on Thinking Machines Lab's TML-Interaction-Small.

What Is GPT-Realtime-2?

GPT-Realtime-2 is the new realtime voice model in OpenAI's API, and the first voice model OpenAI describes as having "GPT-5-class reasoning."

It's built for live voice interactions; i.e., someone is talking into it, not typing.

The model is designed to keep the conversation moving while it reasons through a request, calls tools, and handles corrections. In other words, it responds in a way that fits the moment.

Here are some important characteristics compared to the previous model, GPT-Realtime-1.5:

the context window jumps from 32K to 128K
developers can now dial in reasoning effort
small touches like preamble phrases make voice agents feel less robotic

What Is GPT-Realtime-Translate?

GPT-Realtime-Translate is OpenAI's new live speech translation model, supporting 70+ input languages and 13 output languages.

It's built for the voice-to-voice case: each person speaks in their preferred language, and the model translates in real time. It is supposed to hold meaning together when speakers switch context, use regional pronunciation, or drop in domain-specific terms.

What Is GPT-Realtime-Whisper?

GPT-Realtime-Whisper is OpenAI's new streaming speech-to-text model, built for low-latency transcription as the speaker talks.

The original Whisper was designed for completed chunks of audio. With the newer streaming version, we have a model that is more useful for live broadcast captions and voice agents that need to understand the user continuously rather than turn-by-turn.

So, if you're lost, here's the structure:

GPT-Realtime-2 = a full conversational voice agent. Listens, reasons, calls tools, talks back. You use this when you want voice in and voice out.
GPT-Realtime-Translate = a translation pipe. Speech in language A → speech in language B. It's not having a conversation with anyone; it's converting one stream into another.
GPT-Realtime-Whisper = a transcription pipe. Speech in → text out. No reasoning, no voice response. You'd use it for live captions, etc.

Key Features of GPT-Realtime-2

The following features apply to GPT-Realtime-2 specifically.

Preambles

Developers can have the model say short filler phrases like "let me check that" or "one moment while I look into it" before its main response.

This is a big feature because people tend to be pretty impatient or intolerant with awkward silence. Human-style filler is one of those things that makes an agent feel competent.

Parallel tool calls with audio narration

GPT-Realtime-2 can call multiple tools at once and narrate what it's doing while it does. So instead of dead air during a multi-step task, the user gets a running commentary. This is mostly a UX win.

Stronger recovery behavior

When something goes wrong, like if either a tool fails or if a request is ambiguous, say, the model can say something like "I'm having trouble with that right now" instead of going silent or making something up.

Context window: 32K → 128K

The upgrade quadruples the amount of conversation history and context the model can process in a single session, going from 32,000 tokens to 128,000 tokens. This makes the model suited for longer conversations without drift.

Adjustable reasoning effort

Developers can now select from minimal, low, medium, high, and xhigh reasoning levels.

Low is the default, which keeps latency down for simple back-and-forth, with more deliberate options when the request is harder.

Better domain understanding and tone control

The model now better retains specialized terminology, such as healthcare terms or financial jargon. It can also adjust its delivery: calmer when resolving an issue, empathetic when a user is frustrated, upbeat when confirming a successful action.

GPT-Realtime-2 Benchmark Results

Let's take a look at the benchmarks. OpenAI is comparing against GPT-Realtime-1.5, which makes for a clean year-over-year picture:

Big Bench Audio (audio intelligence): 81.4% → 96.6% — a 15.2 point lift.
Audio MultiChallenge (instruction following in spoken dialogue): 34.7% → 48.5% — a 13.8 point lift.

The Big Bench Audio number is interesting. 96.6% tells us the benchmark is approaching saturation. Audio MultiChallenge, on the other hand, is still under 50%, so this second benchmark result is a useful reality check. "Better than last year's voice model" and "ready for unsupervised production" are different bars.

Worth flagging: these numbers were run at "high" and "xhigh" reasoning settings. The default in production will be "low" for latency reasons, so users may have a different experience than their expectation based on the headline benchmark result.

How Can I Access GPT-Realtime-2?

All three audio models are available now in the Realtime API:

GPT-Realtime-2: $32 per 1M audio input tokens ($0.40 for cached input), $64 per 1M audio output tokens.
GPT-Realtime-Translate: $0.034 per minute.
GPT-Realtime-Whisper: $0.017 per minute.

The two-minute-priced models are much easier to reason about for budgeting. Per-token audio pricing is hard to convert into "what will this cost per call?" without actually instrumenting it. So a developer should expect to spend a little time modeling expected costs before shipping or promising something.

Luckily, you can test GPT-Realtime-2 in the Playground, and OpenAI is pointing developers toward Codex (we have written a lot about Codex) with a starter prompt for adding it to existing apps.

GPT-Realtime-2 and Safety

On the safety side, OpenAI says active classifiers can halt sessions that violate its harmful content guidelines, and developers can layer their own guardrails via the Agents SDK, which we've also written about.

Keep in mind: Voice introduces very specific ways things can go wrong. These are worth talking about:

Accidental activations: The system starts listening or responding when nobody meant to talk to it.
Ambient audio capture: Once a microphone is on, it picks up everything in the room, not just the user. Background conversations, kids, coworkers, a TV, a confidential meeting next door, etc., etc.
Voice-cloning concerns: Voice is biometric. Synthetic speech that sounds like a real person can be used for impersonation, fraud, or bypassing voice-authentication systems. This is both an output and input concern.

Final Thoughts

OpenAI is bundling the things that make voice agents feel competent — filler phrases, narrated tool calls, graceful recovery, a big context window, a real reasoning dial — into a model that can also actually reason. What this amount to for the user: fewer awkward silences and conversations that are less likely to fall apart. That's a big step forward.

What is GPT-Realtime-2 and what makes it different from previous realtime models?

What languages does GPT-Realtime-Translate support?

When should you use GPT-Realtime-Whisper instead of other transcription models?

How are the three models priced?

Author

Josef Waples

Topics

Artificial Intelligence

Learn with DataCamp

Track

AI Business Fundamentals

12 hr

Accelerate your AI journey, conquer ChatGPT, and develop a comprehensive Artificial Intelligence strategy.

See Details

Start Course

Track

AI for Software Engineering

7 hr

Write code and build software applications faster than ever before with the latest AI developer tools, including GitHub Copilot, Windsurf, and Replit.

See Details

Start Course

Course

Generative AI for Business

1 hr

58.6K

Learn the role Generative Artificial Intelligence plays today and will play in the future in a business environment.

See Details

Start Course

blog

GPT-5.1: Two Models, Automatic Routing, Adaptive Reasoning, and More

OpenAI's latest update emphasizes user experience with intelligent model routing and deeper control over tone and style.

Josef Waples

10 min

blog

GPT-5.3 Instant: Features, Tests, and Availability

OpenAI's latest LLM prioritizes natural conversation, smarter web search, and fewer hallucinations.

Josef Waples

7 min

blog

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

Discover how GPT-5.2 improves knowledge work with major upgrades in long-context reasoning, tool calling, coding, vision, and end-to-end workflow execution.

Josef Waples

10 min

blog

GPT 5.5 Instant: An Upgrade to OpenAI’s Default Model

OpenAI's latest default prioritizes factual reliability, concise answers, and memory you can audit.

Josef Waples

8 min

Tutorial

GPT-Realtime-2 API Tutorial: Three Tests, Three Verdicts

Learn how OpenAI's gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper differ, then test each one with working Python WebSocket code.

Khalid Abdelaty

Tutorial

OpenAI GPT‑5 API: Hands-On With New Features

Explore the latest OpenAI GPT-5 API features with code examples, including reasoning effort, verbosity control, chain-of-thought handoff, freeform input, output constraints, allowed tools, preambles, prompt optimization, and more.

Abid Ali Awan

See More See More

What Is GPT-Realtime-2?

What Is GPT-Realtime-Translate?

What Is GPT-Realtime-Whisper?

Key Features of GPT-Realtime-2

Preambles

Parallel tool calls with audio narration

Stronger recovery behavior

Context window: 32K → 128K

Adjustable reasoning effort

Better domain understanding and tone control

GPT-Realtime-2 Benchmark Results

How Can I Access GPT-Realtime-2?

GPT-Realtime-2 and Safety

Final Thoughts

GPT-Realtime-2 FAQs

When should you use GPT-Realtime-Whisper instead of other transcription models?

How are the three models priced?

GPT-5.1: Two Models, Automatic Routing, Adaptive Reasoning, and More

GPT-5.3 Instant: Features, Tests, and Availability

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

GPT 5.5 Instant: An Upgrade to OpenAI’s Default Model

GPT-Realtime-2 API Tutorial: Three Tests, Three Verdicts

OpenAI GPT‑5 API: Hands-On With New Features

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}AI Business Fundamentals

AI for Software Engineering

Generative AI for Business

GPT-5.1: Two Models, Automatic Routing, Adaptive Reasoning, and More

GPT-5.3 Instant: Features, Tests, and Availability

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

GPT 5.5 Instant: An Upgrade to OpenAI’s Default Model

GPT-Realtime-2 API Tutorial: Three Tests, Three Verdicts

OpenAI GPT‑5 API: Hands-On With New Features

AI Business Fundamentals