Skip to main content

Grok Voice Agent Builder: A Hands-On Guide in Python

Build a Python voice agent with the same API used by Grok Voice Agent Builder: WebSocket setup, audio streaming, tool calling, cost tracking, and a FastAPI endpoint.
Jul 2, 2026  · 11 min read

xAI released Voice Agent Builder, a console for creating voice agents. You describe the call flow, attach documents and tools, and choose a voice.

When I test a voice agent console, I care less about the launch note and more about the parts I have to wire into code: how the WebSocket session is configured, how audio moves, where tool calls happen, what the call costs, and how another app would call the workflow.

The code below rebuilds that flow directly against the Voice Agent API. Specifically, we'll use a clinic appointment assistant that checks availability, replies by voice, tracks cost, handles tool failures, and exposes a FastAPI endpoint.

What Is Grok Voice Agent Builder?

Voice Agent Builder is xAI's console for creating and deploying voice agents on Grok Voice. It launched in beta on July 1, 2026. Instead of using separate speech to text, language model, and text to speech services, it uses one voice model path.

The console includes telephony, document retrieval, tools and connectors, guardrails, remote MCP servers, and call logs with recordings, transcripts, and traces.

Audio is billed by the minute. The console is still beta, so we use the API directly.

How the Grok Voice Agent API Works Under the Builder

Under the console is the Voice Agent API, a realtime WebSocket API that exposes the same runtime used by the Builder.

Diagram showing the Grok Voice Agent Builder console layered on top of the xAI Voice Agent API WebSocket.

Builder sits atop the Voice API. Image by Author.

The model used here is grok-voice-think-fast-1.0. The grok-voice-latest alias points at the newest model. I use it here, but for a deployed app I would pin the versioned name. xAI reports a 67.3% score for this model on the τ-voice Bench leaderboard; I treat that as one data point, not a guarantee.

Compatibility note: the API is compatible with the OpenAI Realtime API. If you have code that talks to OpenAI's realtime endpoint, you mostly change the base URL and the key.

Project Overview: What We'll Build

The clinic assistant takes spoken input, replies in a generated voice, asks follow up questions, checks availability before offering a slot, and hands off to a human when needed. The core example uses one tool; the Streamlit demo adds booking, transfer, and end call actions.

The core tutorial splits into four files, each with one job:

  • voice_client.py holds the WebSocket client, audio helpers, and cost tracking

  • tools.py holds check_availability, plus extra demo tools used by Streamlit

  • assistant.py holds the system prompt, session config, and the workflow

  • app.py serves the whole thing through FastAPI

Those four files are the path through the article. The repo also includes app_streamlit.py for the visual demo and run.py as a Windows launcher, but we will come back to those after the core flow works.

Prerequisites

Before the code runs, you need Python 3.10 or newer, an xAI account, an API key from console.x.ai, prepaid credits, and basic comfort with environment variables, JSON, and WebSockets.

Setting up the project

Create a folder and a virtual environment, then install the packages:

mkdir appointment-agent
cd appointment-agent
python -m venv .venv
.venv\Scripts\activate       # macOS/Linux: source .venv/bin/activate
pip install websockets python-dotenv fastapi uvicorn pydantic httpx numpy streamlit

Pin these packages in a requirements.txt so a fresh checkout uses the same setup.

Create a .env file next to the Python files:

XAI_API_KEY=xai-your-key-here

Add .env to .gitignore. The API key should stay on the server.

Building the Voice Agent

Let's start building.

Connecting to the Grok Voice Agent API via WebSocket

The first step is opening the connection. Pass the model as a query parameter and your key as a bearer token on the handshake:

import asyncio
import json
import os
import websockets

async def voice_agent():
    url = "wss://api.x.ai/v1/realtime?model=grok-voice-latest"
    async with websockets.connect(
        url,
        additional_headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
    ) as ws:
        async for message in ws:
            print(json.loads(message)["type"])

asyncio.run(voice_agent())

Against a live key, the first event you see is session.created, which means the socket is open and ready to configure.

Terminal output printing the session.created event after connecting to the Grok Voice Agent API over WebSocket.

Session created event confirms the connection. Image by Author.

Configuring the voice session

A live socket is not a configured agent. You shape it by sending a session.update event with a session object.

Voice, audio format, and instructions

The three settings you touch most are the voice, the audio format, and the system prompt. The realtime API exposes five named voices, eve, ara, rex, sal, and leo, plus any custom clone. Audio defaults to audio/pcm at 24000 Hz, with input and output configured separately.

Here is the session config the assistant uses, assembled in assistant.py:

def build_session_config(voice="ara", instructions=SYSTEM_PROMPT, sample_rate=24000):
    # The model needs to know "today" or it guesses the year for a date like "July 6th".
    instructions = f"{instructions}\nToday's date is {date.today().isoformat()}."
    return {
        "voice": voice,
        "instructions": instructions,
        "turn_detection": None,  # manual turns for file-based input
        "audio": {
            "input": {"format": {"type": "audio/pcm", "rate": sample_rate}},
            "output": {"format": {"type": "audio/pcm", "rate": sample_rate}},
        },
        "tools": [CHECK_AVAILABILITY_TOOL],
    }

The instructions field is the system prompt. This clinic prompt stays short because long voice replies are hard to follow:

You are a voice appointment assistant for a small clinic. Help callers book,
reschedule, cancel, or ask questions about appointments, services, and hours.
Answer whatever the caller asks that relates to the clinic. Keep responses short
and natural for a phone conversation. Ask one question at a time. Confirm
important details before taking action. Use the availability tool before offering
a time slot. Escalate to a human for medical, urgent, sensitive, or unclear
requests. If a caller asks about something unrelated to the clinic, say briefly
that it is outside what you can help with, then steer back to booking. If you
cannot make out what the caller said, ask them to repeat it instead of repeating
your last message.

The escalation line keeps the clinic agent out of medical advice. The last two lines keep it on scope and stop loops when the caller is unclear. The config also appends today's date because, in my live tests, the model could guess the wrong year for dates like "July 6th."

Tuning turn detection

Turn detection is how the agent decides you have stopped speaking. Set turn_detection.type to server_vad and the server ends the turn on silence. Leave it null and you control turns by committing the audio buffer, which is what I use for the file flow.

Server VAD has three settings worth knowing: threshold sets how loud audio must be to count as speech, silence_duration_ms sets how long a pause ends the turn, and prefix_padding_ms keeps a little audio before speech starts. If your agent interrupts people, raise silence_duration_ms first.

Sending audio to the agent

Now we send the caller's voice. The audio must match the session format: mono 16 bit PCM at 24000 Hz, encoded as base64 and sent in chunks.

The client streams the file in slices, then commits the buffer to mark the end of the turn:

async def send_audio(self, pcm_bytes, chunk_ms=100, commit=True):
    bytes_per_chunk = int(self._sample_rate * 2 * chunk_ms / 1000)
    for start in range(0, len(pcm_bytes), bytes_per_chunk):
        chunk = pcm_bytes[start:start + bytes_per_chunk]
        await self._t.send({
            "type": "input_audio_buffer.append",
            "audio": base64.b64encode(chunk).decode(),
        })
    if commit:
        await self._t.send({"type": "input_audio_buffer.commit"})
    self.cost.audio_seconds += pcm_seconds(pcm_bytes, self._sample_rate)

If your sample rate or encoding does not match session.update, you may get static or silence instead of a clean error. Audio goes through input_audio_buffer.append, so it bills by duration rather than per message.

Receiving voice responses

After you request a response, audio arrives as response.output_audio.delta, the transcript arrives as response.output_audio_transcript.delta, and response.done closes the turn.

The client collects all of that in one async loop:

async def _collect_response(self):
    audio = bytearray()
    transcript, calls = [], []
    while True:
        event = await self._recv()
        etype = event["type"]
        if etype == "response.output_audio.delta":
            audio += base64.b64decode(event["delta"])
        elif etype == "response.output_audio_transcript.delta":
            transcript.append(event.get("delta", ""))
        elif etype == "response.function_call_arguments.done":
            calls.append(event)
        elif etype == "response.done":
            break
    return bytes(audio), "".join(transcript), calls

Decode the audio deltas, append them in order, and write the result to a response.wav file. To capture the caller's own words, set audio.input.transcription and read conversation.item.input_audio_transcription.completed.

Building the Appointment Assistant Workflow

Now the pieces become a conversation: booking request, clarifying question, availability check, offered slots, confirmation. To carry context across turns, each new turn reconnects with the conversation id and opts into session resumption.

Adding tool calling to the voice agent

For the clinic, the agent must check availability before promising a time. Custom tools are how the model reaches your code: it emits a request, your application runs the function, and you send the result back.

The tool is a plain function plus a JSON schema that goes into the session config. Here is the schema from tools.py:

CHECK_AVAILABILITY_TOOL = {
    "type": "function",
    "name": "check_availability",
    "description": "Look up open appointment slots for a service on a given date. "
                   "Always call this before offering the caller a time.",
    "parameters": {
        "type": "object",
        "properties": {
            "service": {"type": "string", "description": "Service requested."},
            "date": {"type": "string", "description": "Requested date as YYYY-MM-DD."},
        },
        "required": ["service", "date"],
    },
}

The loop has a fixed shape. When the model wants the tool, it sends response.function_call_arguments.done with the arguments. You run the function, return a function_call_output, and then send response.create so the agent can continue. Miss that final response.create and the agent goes silent.

flow diagram of the Grok voice tool loop moving from response.function_call_arguments.done to function_call_output to response.create to the audio reply.

The tool call round trip explained. Image by Author.

Custom functions like this run in your code. The Streamlit demo registers three more from the same file: book_appointment, transfer_to_human, and end_call. Built-in tools, such as web search, X search, collections search, and remote MCP tools, execute on xAI's servers.

Handling tool failures

Tools fail, and a voice agent that assumes success can promise a slot that does not exist. My ToolRegistry.execute never raises: a failed lookup comes back as an {"error": ...} dict.

def execute(self, name, arguments):
    handler = self._handlers.get(name)
    if handler is None:
        return {"error": f"unknown tool: {name}"}
    try:
        return handler(**arguments)
    except ToolError as exc:
        return {"error": str(exc)}

An explicit error state stops the agent from treating failed tool calls as success.

Adding cost tracking

Before you serve this to anyone, know what a call costs. Audio bills at $0.05 per minute, counting both what you send and what you receive. Text input events bill at $0.004 each. function_call_output results and response.create events are not billed.

The client tracks it as it goes, so cost is a property you read at any point:

@property
def audio_usd(self):
    rate = 0.05 + (0.01 if self.telephony else 0.0)
    return self.audio_seconds / 60 * rate

@property
def total_usd(self):
    return self.audio_usd + self.text_usd + self.tool_usd

An xAI provisioned number adds the $0.01 per minute telephony surcharge, which the helper applies when you set telephony=True. Tools hosted by xAI bill separately: web search and X search run about $5 per thousand calls, and file search is about $2.50.

Handling errors and edge cases

Most failures fall into a short list:

  • Missing or invalid API key returns 401 at the handshake, so check the key first

  • A blocked team returns 403, and a rate limit returns 429, which you retry with backoff

  • Malformed session config returns 400, usually a typo in a field name

  • Unsupported audio format gives static, not an error, so match the session rate

  • A missing response.create after a tool result leaves the agent hanging

  • A duplicate booking attempt can cause real problems, so do not retry blindly

Retrying a failed read like check_availability is safe, but retrying a failed write like an actual booking can double book a caller. Any action that changes data needs an idempotency check first.

Using ephemeral tokens for client apps

Everything so far assumes the code runs on your server, where the API key belongs. If a browser or mobile app connects directly, use ephemeral tokens.

Your server calls POST https://api.x.ai/v1/realtime/client_secrets with your key, gets back a token response, and passes the token value to the client. In my run, the response included value and expires_at:

@app.post("/session")
async def create_session():
    async with httpx.AsyncClient() as client:
        response = await client.post(
            CLIENT_SECRETS_URL,
            headers={"Authorization": f"Bearer {os.environ['XAI_API_KEY']}"},
            json={"expires_after": {"seconds": 300}},
        )
    return response.json()

Browsers cannot set custom WebSocket headers, so the token rides in the sec-websocket-protocol header with an xai-client-secret. prefix.

Turning the Workflow Into a FastAPI Endpoint

An endpoint lets a frontend or another service call the workflow. The route validates the request body with a Pydantic model, takes a typed message or an audio path, and returns the transcript, response audio, tool log, latency, and estimated cost.

@app.post("/appointments/voice")
async def appointments_voice(body: VoiceRequest):
    fail = {"check_availability"} if body.simulate_tool_failure else None
    assistant = AppointmentAssistant(voice=body.voice, telephony=body.telephony, fail_tools=fail)
    if body.text:
        result = await assistant.run_live(text=body.text, conversation_id=body.conversation_id)
    else:
        pcm = load_wav_as_pcm(body.audio_path, 24000)
        result = await assistant.run_live(pcm, conversation_id=body.conversation_id)
    return {
        "transcript": result.transcript,
        "audio_wav_base64": base64.b64encode(encode_wav_bytes(result.audio, 24000)).decode(),
        "tool_calls": result.tool_calls,
        "latency_seconds": round(result.latency_s, 3),
        "estimated_cost_usd": round(result.cost.total_usd, 6),
        "audio_seconds": round(result.cost.audio_seconds, 2),
        "conversation_id": result.conversation_id,
    }

Run it with uvicorn app:app --reload and open http://localhost:8000/docs. Read XAI_API_KEY from the server environment and never accept it from a request body.

Testing the voice endpoint in browser. Video by Author.

Testing the Full Voice Agent

An endpoint that returns 200 is not a tested agent. Test behavior: a clean booking over two turns, a fully booked day, a tool failure, and a medical escalation.

You can run these checks from the local script, the FastAPI route, or the Streamlit demo shown near the end:

  • A straightforward booking, does it check availability before offering a time

  • A resumed booking turn, does it call book_appointment after the caller chooses a time and gives a name

  • Unclear audio, does it ask for a repeat rather than inventing a request

  • A failed tool call, does it apologize and recover instead of stalling

  • A medical request, does it escalate like the prompt says

If a caller says they have had chest pain since morning, the core assistant should not book anything, and the Streamlit demo should call transfer_to_human.

Grok Voice Agent Builder: Readiness Notes

That architecture can reduce the handoffs we discussed at the start. xAI reports sub second time to first audio, and a separate test measured around 0.78 seconds. The tool loop depends on the order of tool result events and response.create.

The beta still has limits. The benchmark score above is xAI's own claim, the console UI may change, and tool billing needs separate tracking. I would test it against my own calls before relying on it.

Deployment considerations

Before deployment, keep the API key server side, use ephemeral tokens for client apps, log transcripts and tool calls, add a recording notice, avoid storing audio unless needed, build a human handoff, and test with noise, accents, interruptions, and callers who change their minds.

Two limits shape deployment design: the API allows 100 concurrent sessions per team and caps a single session at 120 minutes. Resumed session history is dropped after 30 minutes of inactivity. If you handle patient data, read xAI's compliance terms carefully.

When should you use Grok Voice Agent Builder?

I would consider this category when the interaction happens live and the agent needs to act, not just answer. Appointment booking, customer support, and internal lookup workflows are the clearest cases.

I would avoid it when a text chatbot would work, when you only need batch transcription, when the workflow has not been tested with real users, or when you cannot yet handle errors, privacy, and escalation safely.

Voice makes sense when the conversation has to happen out loud and the agent has to do something during it. If neither is true, the extra complexity usually is not needed.

The Streamlit demo in this repo lets you test the agent with text, uploaded audio, or a microphone recording. You can watch the transcript, tool calls, event log, booking state, and cost update after each turn. The source is on GitHub. The screen recording below shows that workflow against a live key.

The Streamlit demo running a multi-turn booking flow against a live Grok Voice session. Video by Author.

Conclusion

At this point, the appointment assistant is wired to the Voice Agent API in both a local script and a FastAPI route. The Streamlit demo uses the same client and adds the booking, transfer, and end call tools.

The same pattern works for other voice workflows. Swap the clinic prompt for a support prompt, replace check_availability with an order lookup tool, and keep the same WebSocket, tool loop, and cost tracking code. Before deployment, test it with your own calls, tools, and escalation rules.

If you want to practice the API side before wiring this into a voice workflow, our Introduction to APIs in Python course covers requests, headers, status codes, authentication, and JSON payloads. For the serving layer, our Introduction to FastAPI course covers routes, request models, async handlers, and endpoint testing.


Khalid Abdelaty's photo
Author
Khalid Abdelaty
LinkedIn

I’m a data engineer and community builder who works across data pipelines, cloud, and AI tooling while writing practical, high-impact tutorials for DataCamp and emerging developers.

FAQs

How is the Voice Agent API different from xAI's speech-to-text API?

They solve different problems. The earlier comparison is the short version: use the Voice Agent API for live conversation and speech-to-text for recordings.

Should I keep one WebSocket open for the whole call?

Yes, for an app with a live chat UI. Reconnecting every turn can resume from a stale server snapshot if the caller replies quickly. In the Streamlit demo, I keep one socket open for the whole call and only use resumption if the socket drops.

Why does my agent go silent after a tool call?

The tool section covered the common cause: a missing response.create after the function_call_output. The less obvious version is timing. If you send response.create while the previous turn's audio is still playing, replies overlap.

Why does my voice input get transcribed wrong?

First, play back the exact audio you sent. If it sounds wrong, fix the microphone path before touching the prompt. If it sounds fine, use a language hint and teach the prompt to repair small transcription errors from context, especially times, names, and service words.

Should a booked appointment disappear from availability?

Yes. A booking tool should change state, even in a demo. In this project, book_appointment removes the slot from the in-memory schedule, so a later availability check in the same server session will not offer it again.

Topics

Learn with DataCamp

Track

AI Agent Fundamentals

6 hr
Discover how AI agents can change how you work and deliver value for your organization!
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Tutorial

Grok Imagine API: A Complete Python Guide With Examples

Learn how to generate videos using the Grok Imagine API. This Python guide covers everything from image animations to video editing with the new xAI video model.
François Aubry's photo

François Aubry

Tutorial

Grok 3 API: A Step-by-Step Guide With Examples

Learn how to use the Grok 3 API for tasks ranging from basic queries to advanced features like function calling and structured outputs.
Tom Farnschläder's photo

Tom Farnschläder

Tutorial

OpenAI's Audio API: A Guide With Demo Project

Learn how to build a voice-to-voice assistant using OpenAI's latest audio models and streamline your workflow using the Agents API.
François Aubry's photo

François Aubry

Tutorial

Grok 4 API: A Step-by-Step Guide With Examples

Learn how to use Grok 4’s API through practical examples featuring image recognition, reasoning, function calling, and structured output.
Tom Farnschläder's photo

Tom Farnschläder

Tutorial

Claude Fable 5 API Tutorial: Build a Developer Task Assistant in Python

Connect the Claude Fable 5 API to a Python project and build a developer task assistant with structured JSON outputs, streaming, tool use, refusal handling, and a FastAPI endpoint.
Khalid Abdelaty's photo

Khalid Abdelaty

code-along

Working with APIs in Python

Chris Ramakers, Developer Platform and Design System Engineering Manager at DataCamp, guides you through the process of building a simple bot using Python.
Chris Ramakers's photo

Chris Ramakers

See MoreSee More