Skip to main content

OpenAI's GPT-Realtime-2: A Voice Model with GPT-5-Class Reasoning

OpenAI's three new audio models — GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper — allow for live translation and streaming transcription in the Realtime API.
May 7, 2026  · 9 min read

Just two days after the release of GPT-5.5 Instant, OpenAI has more big news, with a new release that focuses on three things at once:

  • a voice model that can reason while it talks: that's GPT‑Realtime‑2
  • live translation across 70+ languages: GPT‑Realtime‑Translate
  • and finally, streaming transcription that keeps up with real conversations: GPT‑Realtime‑Whisper

In this article, we will catch you up to speed on the news with these three models. 

What Is GPT-Realtime-2?

GPT-Realtime-2 is the new realtime voice model in OpenAI's API, and the first voice model OpenAI describes as having "GPT-5-class reasoning."

It's built for live voice interactions; i.e., someone is talking into it, not typing.

The model is designed to keep the conversation moving while it reasons through a request, calls tools, and handles corrections. In other words, it responds in a way that fits the moment.

Here are some important characteristics compared to the previous model, GPT-Realtime-1.5:

  • the context window jumps from 32K to 128K
  • developers can now dial in reasoning effort
  • small touches like preamble phrases make voice agents feel less robotic

What Is GPT-Realtime-Translate?

GPT-Realtime-Translate is OpenAI's new live speech translation model, supporting 70+ input languages and 13 output languages.

It's built for the voice-to-voice case: each person speaks in their preferred language, and the model translates in real time. It is supposed to hold meaning together when speakers switch context, use regional pronunciation, or drop in domain-specific terms.

What Is GPT-Realtime-Whisper?

GPT-Realtime-Whisper is OpenAI's new streaming speech-to-text model, built for low-latency transcription as the speaker talks.

The original Whisper was designed for completed chunks of audio. With the newer streaming version, we have a model that is more useful for live broadcast captions and voice agents that need to understand the user continuously rather than turn-by-turn.

So, if you're lost, here's the structure:

  • GPT-Realtime-2 = a full conversational voice agent. Listens, reasons, calls tools, talks back. You use this when you want voice in and voice out.
  • GPT-Realtime-Translate = a translation pipe. Speech in language A → speech in language B. It's not having a conversation with anyone; it's converting one stream into another. 
  • GPT-Realtime-Whisper = a transcription pipe. Speech in → text out. No reasoning, no voice response. You'd use it for live captions, etc.

Key Features of GPT-Realtime-2

The following features apply to GPT-Realtime-2 specifically.

Preambles

Developers can have the model say short filler phrases like "let me check that" or "one moment while I look into it" before its main response.

This is a big feature because people tend to be pretty impatient or intolerant with awkward silence. Human-style filler is one of those things that makes an agent feel competent.

Parallel tool calls with audio narration

GPT-Realtime-2 can call multiple tools at once and narrate what it's doing while it does. So instead of dead air during a multi-step task, the user gets a running commentary. This is mostly a UX win. 

Stronger recovery behavior

When something goes wrong, like if either a tool fails or if a request is ambiguous, say, the model can say something like "I'm having trouble with that right now" instead of going silent or making something up. 

Context window: 32K → 128K

The upgrade quadruples the amount of conversation history and context the model can process in a single session, going from 32,000 tokens to 128,000 tokens. This makes the model suited for longer conversations without drift.

Adjustable reasoning effort

Developers can now select from minimal, low, medium, high, and xhigh reasoning levels.

Low is the default, which keeps latency down for simple back-and-forth, with more deliberate options when the request is harder. 

Better domain understanding and tone control

The model now better retains specialized terminology, such as healthcare terms or financial jargon. It can also adjust its delivery: calmer when resolving an issue, empathetic when a user is frustrated, upbeat when confirming a successful action. 

GPT-Realtime-2 Benchmark Results

Let's take a look at the benchmarks. OpenAI is comparing against GPT-Realtime-1.5, which makes for a clean year-over-year picture:

  • Big Bench Audio (audio intelligence): 81.4% → 96.6% — a 15.2 point lift.
  • Audio MultiChallenge (instruction following in spoken dialogue): 34.7% → 48.5% — a 13.8 point lift.

The Big Bench Audio number is interesting. 96.6% tells us the benchmark is approaching saturation. Audio MultiChallenge, on the other hand, is still under 50%, so this second benchmark result is a useful reality check. "Better than last year's voice model" and "ready for unsupervised production" are different bars.

Worth flagging: these numbers were run at "high" and "xhigh" reasoning settings. The default in production will be "low" for latency reasons, so users may have a different experience than their expectation based on the headline benchmark result.

How Can I Access GPT-Realtime-2?

All three audio models are available now in the Realtime API:

  • GPT-Realtime-2: $32 per 1M audio input tokens ($0.40 for cached input), $64 per 1M audio output tokens.
  • GPT-Realtime-Translate: $0.034 per minute.
  • GPT-Realtime-Whisper: $0.017 per minute.

The two-minute-priced models are much easier to reason about for budgeting. Per-token audio pricing is hard to convert into "what will this cost per call?" without actually instrumenting it. So a developer should expect to spend a little time modeling expected costs before shipping or promising something.

Luckily, you can test GPT-Realtime-2 in the Playground, and OpenAI is pointing developers toward Codex (we have written a lot about Codex) with a starter prompt for adding it to existing apps. 

GPT-Realtime-2 and Safety

On the safety side, OpenAI says active classifiers can halt sessions that violate its harmful content guidelines, and developers can layer their own guardrails via the Agents SDK, which we've also written about.

Keep in mind: Voice introduces very specific ways things can go wrong. These are worth talking about:

  • Accidental activations: The system starts listening or responding when nobody meant to talk to it.
  • Ambient audio capture: Once a microphone is on, it picks up everything in the room, not just the user. Background conversations, kids, coworkers, a TV, a confidential meeting next door, etc., etc.
  • Voice-cloning concerns: Voice is biometric. Synthetic speech that sounds like a real person can be used for impersonation, fraud, or bypassing voice-authentication systems. This is both an output and input concern.

Final Thoughts

OpenAI is bundling the things that make voice agents feel competent — filler phrases, narrated tool calls, graceful recovery, a big context window, a real reasoning dial — into a model that can also actually reason. What this amount to for the user: fewer awkward silences and conversations that are less likely to fall apart. That's a big step forward.

GPT-Realtime-2 FAQs

What is GPT-Realtime-2 and what makes it different from previous realtime models?

GPT-Realtime-2 is OpenAI's most intelligent voice model to date, bringing GPT-5-class reasoning to real-time voice interactions. Unlike earlier realtime models, it can plan, decide, use tools, recover from interruptions, and handle longer agentic workflows, all while staying naturally responsive in conversation.

What languages does GPT-Realtime-Translate support?

It accepts input from 70+ languages and outputs translated audio in 13 languages, returning both translated audio and transcript deltas while the source speaker is still talking. The full list of supported languages was not released yet at the time of writing.

When should you use GPT-Realtime-Whisper instead of other transcription models?

Use it when your app needs live transcript deltas from streaming audio (e.g., live captions or meeting notes), as opposed to GPT-4o-Transcribe, which is better for offline or request-response transcription where higher accuracy or cost matter more than latency.

How are the three models priced?

The models are priced as follows:

  • gpt-realtime-2: $32/1M audio input tokens, $0.40/1M cached, $64/1M audio output tokens (text: $4/$0.40/$24 per 1M tokens)
  • gpt-realtime-translate: $0.034 per minute of audio
  • gpt-realtime-whisper: $0.017 per minute of audio

Both gpt-realtime-translate and gpt-realtime-whisper are billed by audio duration rather than text tokens, which makes cost more predictable.


Josef Waples's photo
Author
Josef Waples

I'm a data science writer and editor with contributions to research articles in scientific journals. I'm especially interested in linear algebra, statistics, R, and the like. I also play a fair amount of chess! 

Topics

Learn with DataCamp

Track

AI Business Fundamentals

12 hr
Accelerate your AI journey, conquer ChatGPT, and develop a comprehensive Artificial Intelligence strategy.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

blog

GPT-5.1: Two Models, Automatic Routing, Adaptive Reasoning, and More

OpenAI's latest update emphasizes user experience with intelligent model routing and deeper control over tone and style.
Josef Waples's photo

Josef Waples

10 min

blog

GPT-5.3 Instant: Features, Tests, and Availability

OpenAI's latest LLM prioritizes natural conversation, smarter web search, and fewer hallucinations.
Josef Waples's photo

Josef Waples

7 min

blog

GPT 5.2: Benchmarks, Model Breakdown, and Real-World Performance

Discover how GPT-5.2 improves knowledge work with major upgrades in long-context reasoning, tool calling, coding, vision, and end-to-end workflow execution.
Josef Waples's photo

Josef Waples

10 min

blog

GPT 5.5 Instant: An Upgrade to OpenAI’s Default Model

OpenAI's latest default prioritizes factual reliability, concise answers, and memory you can audit.
Josef Waples's photo

Josef Waples

8 min

blog

ChatGPT Images 2.0: A Guide to OpenAI's Next-Generation Image Model

Discover how ChatGPT Images 2.0 pushes image generation into a new era with stronger real-world reasoning, multilingual text rendering, stylistic realism, and a visual thought-partner workflow.
Josef Waples's photo

Josef Waples

14 min

Tutorial

OpenAI GPT‑5 API: Hands-On With New Features

Explore the latest OpenAI GPT-5 API features with code examples, including reasoning effort, verbosity control, chain-of-thought handoff, freeform input, output constraints, allowed tools, preambles, prompt optimization, and more.
Abid Ali Awan's photo

Abid Ali Awan

See MoreSee More