Program
Just two days after the release of GPT-5.5 Instant, OpenAI has more big news, with a new release that focuses on three things at once:
- a voice model that can reason while it talks: that's GPT‑Realtime‑2
- live translation across 70+ languages: GPT‑Realtime‑Translate
- and finally, streaming transcription that keeps up with real conversations: GPT‑Realtime‑Whisper
In this article, we will catch you up to speed on the news with these three models.
What Is GPT-Realtime-2?
GPT-Realtime-2 is the new realtime voice model in OpenAI's API, and the first voice model OpenAI describes as having "GPT-5-class reasoning."
It's built for live voice interactions, ie: someone is talking, not typing into it.
It's designed to keep the conversation moving while it reasons through a request, calls tools, and handles corrections. In other words, it is mean to responds in a way that fits the moment.
Here are some important characteristics compared to the previous model, GPT-Realtime-1.5:
- the context window jumps from 32K to 128K
- developers can now dial in reasoning effort
- small touches like preamble phrases make voice agents feel less robotic
What Is GPT-Realtime-Translate?
GPT-Realtime-Translate is OpenAI's new live speech translation model, supporting 70+ input languages and 13 output languages.
It's built for the voice-to-voice case: each person speaks in their preferred language, and the model translates in real time. It is supposed to hold meaning together when speakers switch context, use regional pronunciation, or drop in domain-specific terms.
What Is GPT-Realtime-Whisper?
GPT-Realtime-Whisper is OpenAI's new streaming speech-to-text model, built for low-latency transcription as the speaker talks.
The original Whisper, which was designed for completed chunks of audio. With the newer streaming version, we have a model that is more useful for live broadcast captions, and voice agents that need to understand the user continuously rather than turn-by-turn.
So, if you're lost, here's the structure:
- GPT-Realtime-2 = a full conversational voice agent. Listens, reasons, calls tools, talks back. You use this when you want voice in and voice out.
- GPT-Realtime-Translate = a translation pipe. Speech in language A → speech in language B. It's not having a conversation with anyone; it's converting one stream into another.
- GPT-Realtime-Whisper = a transcription pipe. Speech in → text out. No reasoning, no voice response. You'd use it for live captions, etc.
Key Features of GPT-Realtime-2
The following features apply to GPT-Realtime-2 specifically.
Preambles
Developers can have the model say short filler phrases like "let me check that" or "one moment while I look into it" before its main response.
This is a big feature because people tend to be pretty impatient or intolerant with awkward silence. Human-style filler is one of those things that makes an agent feel competent.
Parallel tool calls with audio narration
GPT-Realtime-2 can call multiple tools at once and narrate what it's doing while it does. So instead of dead air during a multi-step task, the user gets a running commentary. This is mostly a UX win.
Stronger recovery behavior
When something goes wrong, like if either a tool fails or if a request is ambiguous, say, the model can say something like "I'm having trouble with that right now" instead of going silent or making something up.
Context window: 32K → 128K
The context window jumps from 32K to 128K tokens.
Adjustable reasoning effort
Developers can now select from minimal, low, medium, high, and xhigh reasoning levels.
Low is the default, which keeps latency down for simple back-and-forth, with more deliberate options when the request is harder.
Better domain understanding and tone control
The model now better retains specialized terminology - healthcare terms, etc. It can also adjust its delivery: calmer when resolving an issue, empathetic when a user is frustrated, upbeat when confirming a successful action.
GPT-Realtime-2 Benchmark Results
Let's take a look at the benchmarks. OpenAI is comparing against GPT-Realtime-1.5, which makes for a clean year-over-year picture:
- Big Bench Audio (audio intelligence): 81.4% → 96.6% — a 15.2 point lift.
- Audio MultiChallenge (instruction following in spoken dialogue): 34.7% → 48.5% — a 13.8 point lift.
The Big Bench Audio number is interesting. 96.6% tells us the benchmark is approaching saturation. Audio MultiChallenge, on the other hand, is still under 50%, so this second benchmark result is a useful reality check. "Better than last year's voice model" and "ready for unsupervised production" are different bars.
Worth flagging: these numbers were run at "high" and "xhigh" reasoning settings. The default in production will be "low," for latency reasons, so users may have a different experience than their expecation based on the headline benchmark result.
How Can I Access GPT-Realtime-2?
All three audio models are available now in the Realtime API:
- GPT-Realtime-2: $32 per 1M audio input tokens ($0.40 for cached input), $64 per 1M audio output tokens.
- GPT-Realtime-Translate: $0.034 per minute.
- GPT-Realtime-Whisper: $0.017 per minute.
The two-minute-priced models are much easier to reason about for budgeting. Per-token audio pricing is hard to convert into "what will this cost per call?" without actually instrumenting it. So a developer should expect to spend a little time modeling expected costs before shipping or promising something.
Luckily, you can test GPT-Realtime-2 in the Playground, and OpenAI is pointing developers toward Codex (we have written a lot about Codex) with a starter prompt for adding it to existing apps.
GPT-Realtime-2 and Safety
On the safety side, OpenAI says active classifiers can halt sessions that violate its harmful content guidelines, and developers can layer their own guardrails via the Agents SDK, which we've also written about.
Keep in mind: Voice introduces very specific ways things can go wrong. These are worth talking about:
- Accidental activations: The system starts listening or responding when nobody meant to talk to it.
- Ambient audio capture: Once a microphone is on, it picks up everything in the room, not just the user. Background conversations, kids, coworkers, a TV, a confidential meeting next door, etc., etc.
- Voice-cloning concerns: Voice is biometric. Synthetic speech that sounds like a real person can be used for impersonation, fraud, or bypassing voice-authentication systems. This is both an output and input concern.
Final Thoughts
OpenAI is bundling the things that make voice agents feel competent — filler phrases, narrated tool calls, graceful recovery, a big context window, a real reasoning dial — into a model that can also actually reason. What this amount to for the user: fewer awkward silences and conversations that are less likely to fall apart. That's a big step forward.

I'm a data science writer and editor with contributions to research articles in scientific journals. I'm especially interested in linear algebra, statistics, R, and the like. I also play a fair amount of chess!



