Skip to main content

OpenAI's Audio API: A Guide With Demo Project

Learn how to build a voice-to-voice assistant using OpenAI's latest audio models and streamline your workflow using the Agents API.
Mar 25, 2025  · 12 min read

OpenAI has introduced its next generation of audio models focused on enhancing speech-to-text and text-to-speech capabilities. These latest models promise improved accuracy, especially in complex situations like accents or noisy settings, and offer more customizable voice interactions.

In this tutorial, I will explain step-by-step how to use these new OpenAI audio models to build a voice-to-voice AI assistant. Our goal is to develop an AI assistant that can understand spoken input and reply with a naturally synthesized voice tailored to specific needs.

Develop AI Applications

Learn to build AI applications using the OpenAI API.
Start Upskilling For Free

The OpenAI Audio API

This new release from OpenAI includes three models:

  1. gpt-4o-mini-tts: A text-to-audio model capable of generating audio from text with various tones and voices. A neat feature of this text-to-speech model is that we can guide how the voice sounds by giving specific text instructions. This brings a high level of customization, allowing for the creation of unique and tailored voice experiences. You can try it out on OpenAI.fm.
  2.  gpt-4o-transcribe and gpt-4o-mini-transcribe: Two audio-to-text models that are designed for converting spoken language into written text. Their main function is to provide highly accurate and reliable transcriptions of audio. These models demonstrate a lower word error rate (WER), which means they make fewer mistakes in recognizing spoken words compared to previous solutions.

Next generation models in the OpenAI audio API: gpt-4o mini TTS, gpt-4o mini transcribe, and gpt-4o transcribe

These new models come with the following pricing:

OpenAI audio API pricing

Voice Assistant Project

In this tutorial, I'll guide you through building an AI voice assistant right in your terminal. This voice assistant will essentially mimic a popular text-based AI model but will interact entirely through spoken language. Imagine being able to speak directly to your computer, ask any question you have, and receive a vocal response almost instantly.

Our project will use a simple yet effective architecture. We'll begin by using your microphone to capture your spoken prompt. Once recorded, we'll convert this audio input into text with the help of advanced speech-to-text models.

This text is then fed into a large language model to generate a suitable response. Finally, we'll convert the text response back into audio, allowing the assistant to “speak” the answer back to you. Each step of this process is designed to ensure that our voice assistant is both accurate and engaging.

Audio-to-audio chain architecture using openai audio models

Although OpenAI offers a dedicated Realtime API that can enhance interactions by streamlining the entire process, we will opt for a different approach. The Realtime API, while impressive and perfect for developers looking for swift integrations, is often costlier and offers less flexibility.

By choosing to build our project using separate components for each step, we gain greater control over the customization of our AI assistant. This approach lets us decide the models we want to use, thereby optimizing for specific needs, whether it be accuracy, speed, or preference in tone of response. This way, our voice assistant becomes not only a powerful tool but also a highly tailored one, capable of fitting unique project requirements.

All the code we develop here is available in this GitHub repository

Python Setup

To get started, we'll first set up a new Anaconda environment named audio-demo. Anaconda's environments enable us to create isolated spaces for each project where we can install specific versions of packages without conflicts. Execute the following commands in your command-line interface:

conda create -n audio-demo -y python=3.9
conda activate audio-demo
pip install openai
pip install numpy
pip install dotenv
pip install sounddevice
pip install scipy

Let's break down what each command and package does:

  1. Creating the environment:
  • conda create -n audio-demo -y python=3.9: This command creates a new environment called audio-demo with Python version 3.9. The -y flag automatically agrees to the package installations without prompting.
  1. Activating the environment:
  • conda activate audio-demo: Activates the newly created audio-demo environment, so we can work within it.
  1. Installing packages:
  • pip install openai: OpenAI is a library that provides easy access to OpenAI's models and APIs.
  • pip install numpy: NumPy is a library essential for numerical computing.
  • pip install dotenv: Dotenv helps load environment variables from a .env file, making configuration management easier and safer.
  • pip install sounddevice: Sounddevice allows us to record and play sound using simple functions, which is ideal for handling audio input and output in Python.
  • pip install scipy: SciPy builds on NumPy and provides additional functionality for scientific and technical computing, such as signal processing. In our case, we'll use it to store the audio file.

With our audio-demo environment set up, we are ready to start working on our AI assistant that can process audio inputs. This structured setup helps us maintain a clean development space, ensuring all dependencies are in place for our project.

Open AI API Key Setup

To use the OpenAI API, we need an API key. Go to their API key page and generate an API key by clicking the "Generate new secret key" button. Copy the key, create a file named .env, and paste it there with the following format:

OPENAI_API_KEY=<paste_your_api_key_here>

Text to Audio Example

Let's walk through the steps to create a Python script that uses OpenAI's text-to-audio capabilities, transforming text into speech with a personalized touch. We write our code in a file named text_to_audio.py in the same folder as the .env. file.

Import required libraries

First, we need to import the necessary libraries that will make up our script:

import asyncio
from openai import AsyncOpenAI
from openai.helpers import LocalAudioPlayer
from dotenv import load_dotenv

Let’s quickly walk through what each of these imports does:

  • asyncio: This library is required for writing asynchronous code in Python, which is essential for working with streaming APIs.
  • AsyncOpenAI: A part of the OpenAI library, this provides tools to interact with OpenAI's APIs asynchronously.
  • LocalAudioPlayer: This helper from OpenAI allows us to play audio locally on our machine.
  • load_dotenv: Loads environment variables from the .env file, which is where we store sensitive information like our API keys.

Load environment variables

Next, we load our API key from the .env file using the load_dotenv function:

load_dotenv()

This ensures that our script has secure access to the API key.

Initialize OpenAI

We create an instance of AsyncOpenAI to start interacting with the OpenAI API:

openai = AsyncOpenAI()

Write the main function

Now we define our main function, text_to_audio(), which will use OpenAI's text-to-audio feature to process the input and play the resulting audio:

async def text_to_audio(text, tone_and_style_instructions):
  async with openai.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="coral",
    input=text,
    instructions=tone_and_style_instructions,
    response_format="pcm",
  ) as response:
    await LocalAudioPlayer().play(response)

Let’s quickly explain what we did above:

  • We specify the model and voice parameters to control the speech synthesis. The model used is gpt-4o-mini-tts and the voice selected is "coral".
  • The response_format is set to "pcm", suitable for audio processing.
  • The LocalAudioPlayer then plays the audio response generated by the API.

Execute the function

We complete the script with the following lines to ensure the text_to_audio() function runs when we execute the script:

if __name__ == "__main__":
  asyncio.run(text_to_audio("Hello world!", "Enthusiastic voice."))

This code block checks if the script is the main module being run and executes the text_to_audio() function using asyncio.run() to handle the asynchronous logic.

With these steps, our script is ready to convert text input into speech using OpenAI's text-to-audio service. This setup allows us to experiment with different inputs and styles, bringing text to life through sound.

We can run the script using the command:

python text_to_audio.py

The complete code can be found here.

Audio Transcription From a File

In this section, let's explore how to transcribe an audio file into text using OpenAI’s audio transcription tool. Our script is designed to handle audio files asynchronously to make the process efficient and quick. We'll implement this script in a file named audio_to_text.py.

The imports and initial setup are the same as before, except that we don't need to import the LocalAudioPlayer here. Here's how we can write a function that transcribes an audio file:

async def transcribe_audio(audio_filename = "audio.wav"):
  audio_file = await asyncio.to_thread(open, audio_filename, "rb")
  stream = await openai.audio.transcriptions.create(
    model="gpt-4o-mini-transcribe",
    file=audio_file,
    response_format="text",
    stream=True,
  )
  transcript = ""
  async for event in stream:
    if event.type == "transcript.text.delta":
      print(event.delta, end="", flush=True)
      transcript += event.delta
  print()
  audio_file.close()
  return transcript

Let’s break down what happens here:

  1. Opening the audio file:
  • audio_file = await asyncio.to_thread(open, audio_filename, "rb"): This line opens the audio file in binary read mode ("rb"). The method asyncio.to_thread() allows this file opening operation to run in a separate thread, preventing it from blocking other parts of the program.
  1. Creating a transcription stream:
  • stream = await openai.audio.transcriptions. create(...): This line calls the transcription API. 
  • We specify the model parameter as gpt-4o-mini-transcribe, designed specifically for transcription tasks.
  • The file parameter holds our opened audio file.
  • response_format="text" tells the API to return the transcription as text.
  • stream=True is used to stream the transcription in real time, which means as soon as a part of the audio is processed, it is immediately returned, speeding up the response.
  1. Processing the transcription stream:
  • async for event in stream: Starts a loop to read events from the transcription stream as they occur.
  • if event.type == "transcript.text.delta":: Checks each event type and processes it if it is of type transcript.text.delta, which indicates that a part of the transcription is ready.
  • print(event.delta, end="", flush=True): Prints the incremental transcription as it becomes available, ensuring our output is real-time.
  1. Closing the audio file:
  • audio_file.close(): After we complete the transcription, it's good practice to close the audio file to free up system resources.

By executing the main() function, we can convert an audio file into text efficiently and process it in a streamed fashion to get immediate feedback. This setup is ideal for applications that need fast transcription or involve long audio files.

You can try it by placing an audio file in the same folder as the script, replacing audio.wav with the name of your audio file, and running the command:

python audio_to_text.py

The complete code can be found here.

Audio Transcription from Microphone

Because our goal is to create a voice assistant, we need to record the user's audio prompt into an audio file.

We will create a new file named record.py with a function called record_audio. This function captures sound from the microphone and saves it as an audio file. We won't go into much detail about how it works because it is not the main focus of this article:

import sounddevice as sd
import numpy as np
import scipy.io.wavfile as wavfile

SAMPLE_RATE = 44100  # Sample rate in Hz

def record_audio():
    print("[INFO: Recording... Press <Enter> to stop]")
    audio_data = []  # Initialize a list to store audio frames
    def callback(indata, frames, time, status):
        audio_data.append(indata.copy())
    with sd.InputStream(samplerate=SAMPLE_RATE, channels=1, callback=callback, dtype='int16'):
        input()  # Wait for the user to press Enter to stop recording
    print("[INFO: Recording complete]")
    print()
    audio_data = np.concatenate(audio_data)  # Concatenate the list into a single array
    filename = "output.wav"
    wavfile.write(filename, SAMPLE_RATE, audio_data)
    return audio_data

When we call this function, it will start recording from the user's microphone. It waits until the user presses "Enter" and then saves the audio into a file with the given filename.

To test this, we can combine this function with the transcription function above to transcribe a message spoken by the user. Here's how we can create a new file named record_and_transcribe.py to implement this:

import asyncio
from audio_to_text import transcribe_audio
from audio_recorder import record_audio

async def main():
  record_audio("prompt.wav")
  await transcribe_audio("prompt.wav")

if __name__ == "__main__":
  asyncio.run(main())

You can try running it using the command python record_and_transcribe.py. The script will record what you say until you press "Enter" and then transcribe what you said.

Building an Audio Assistant

In this section, we put it all together to build an audio assistant. We implement it in a new file called audio_assistant.py by following these steps:

  • Record the user's audio prompt using the record_audio() function.
  • Convert the audio prompt to text with the transcribe_audio() function.
  • Use a regular text-to-text model like gpt-4o to generate an answer.
  • Finally, convert the textual answer to audio using the text_to_audio() function.
  • Repeat this until the user exits.

The following diagram illustrates this:

Audio-to-audio chain architecture with loop

I encourage you to try building it yourself before reading further.

First, we import the functions we implemented before and initialize the OpenAI client.

# Import the functions we created
from text_to_audio import text_to_audio
from audio_to_text import transcribe_audio
from audio_recorder import record_audio

# Import other dependencies and initialize OpenAI
import asyncio
from openai import AsyncOpenAI
from dotenv import load_dotenv
load_dotenv()
openai = AsyncOpenAI()

Then, we need a function to generate the answer. This uses the normal OpenAI GPT API with a model like gpt-4o or any other text-to-text model. If you're new to this, you might wanna check this GPT-4o API tutorial.

Here's an async implementation of this function:

async def get_answer(prompt):
  stream = await openai.chat.completions.create(
    model="gpt-4o",
    messages=[
      {"role": "user", "content": prompt}
    ],
    stream=True,
  )
  answer = ""
  async for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
      answer += content
      print(content, end="", flush=True)
  print("\n\n")
  return answer

To implement the main loop, we follow the steps outlined above:

async def main(tone_and_style_instructions):
  await text_to_audio("Hello, how can I help you today?", tone_and_style_instructions)
  while True:
    record_audio("prompt.wav")
    prompt = await transcribe_audio("prompt.wav")
    print()
    answer = await get_answer(prompt)
    await text_to_audio(answer, tone_and_style_instructions)

Finally, we run the main loop when the script is executed:

if __name__ == "__main__":
  tone_and_style_instructions = "Enthusiastic voice."
  asyncio.run(main(tone_and_style_instructions))

Here's a demo of it in action:

The complete code can be found here.

Further Improvements

If we try using the assistant with complex tone and style instructions, we may notice a dissociation between the words and the tone. For example, consider the following instructions for an "Emo Teenager" voice taken from the OpenAI website:

tone_and_style_instructions = """
Tone: Sarcastic, disinterested, and melancholic, with a hint of passive-aggressiveness.
Emotion: Apathy mixed with reluctant engagement.
Delivery: Monotone with occasional sighs, drawn-out words, and subtle disdain, evoking a classic emo teenager attitude.
"""

The voice tone of the audio will match this style, but the text content of the answer generated by the get_answer() function won't take this into account, which can lead to a bit of inconsistency. Here's an example:

To overcome this, we can pair the user message in the get_answer() function with a system prompt indicating that the generated text should follow the tone and style instructions.

For this, we provide the tone_and_style_instructions as the second argument of the get_answer() function and modify the chat request by adding a system message:

async def get_answer(prompt, tone_and_style_instructions):
  stream = await openai.chat.completions.create(
    model="gpt-4o",
    messages=[
      {
        "role": "system", 
        "content": 
          f"""
          The text you generate is being used in a text-to-voice model.
          Make sure your answer matches the guidelines {tone_and_style_instructions}
          """
      },
      {"role": "user", "content": prompt}
    ],
    stream=True,
  )
  answer = ""
  async for chunk in stream:
    content = chunk.choices[0].delta.content
    if content is not None:
      answer += content
      print(content, end="", flush=True)
  print("\n\n")
  return answer

Here’s how the model answers now:

As you can see, the text that it generates now matches the tone instructions. The complete code can be found here.

Building an Assistant with the Agents API

In the previous example, we manually requested the voice-to-text and text-to-voice models to build a voice assistant. By doing so, we learned how to explicitly use the new models from OpenAI’s voice API.

However, if the goal is to build a voice assistant, there’s an easier way to use the agent’s API. This API has been updated to be able to automatically handle the voice-to-text-to-voice workflow we’ve implemented here.

If it’s your first time using the agent’s API, you may wanna take a look at this tutorial on OpenAI Agents SDK.

Before we start, we need to install one more dependency:

pip install 'openai-agents[voice]'

With that out of the way, we start by importing everything we need to run an agent with a voice pipeline:

from dotenv import load_dotenv
load_dotenv()
import asyncio
from agents import Agent
from agents.voice import (
    AudioInput,
    SingleAgentVoiceWorkflow,
    VoicePipeline,
    VoicePipelineConfig,
    TTSModelSettings,
)
from audio_recorder import record_audio
from audio_player import AudioPlayer

The AudioPlayer doesn’t belong to any package. It’s imported from a local file and contains a simple class to help us play the audio in real time. This is required because with the VoicePipeline, we’ll get the audio chunk by chunk, and we'll playback each chunk as we receive it. Here’s the content of the audio_player.py file:

import numpy as np
import sounddevice as sd
class AudioPlayer:
  def __enter__(self):
    self.stream = sd.OutputStream(samplerate=24000, channels=1, dtype=np.int16)
    self.stream.start()
    return self
  def __exit__(self, tp, val, tb):
    self.stream.close()
  def add_audio(self, audio_data):
    self.stream.write(audio_data)

The next step is to create an agent:

agent = Agent(
  name="Voice Assistant",
  instructions="You’re a helpful assistant speaking to a human.",
  model="gpt-4o-mini",
)

Here’s a description of the parameters we used:

  • name: This can be anything we want.
  • instructions: These instructions define what the agent is supposed to be. It works as the system prompt.
  • model: The model that is used to generate answers.

The agent works as the get_answer() function we implemented previously. Think of it as the part of the pipeline that provides an answer to a text prompt.

Next, we define the pipeline. This is where we specify all the configurations related to the voice:

pipeline = VoicePipeline(
  workflow=SingleAgentVoiceWorkflow(agent),
  stt_model="gpt-4o-mini-transcribe",
  tts_model="gpt-4o-mini-tts",
  config=VoicePipelineConfig(
    tts_settings=TTSModelSettings(
      voice="coral",
      instructions="""
        Speak in an enthusiastic voice.
      """
    )
  )
)

Here’s a breakdown of some of the parameters:

  • stt_model: The model to use to convert speech into text.
  • tts_model: The model used to convert text into speech.
  • config: Provide the pipeline configuration. Here we use it to specify the voice we want the tts_model to use, as well as the speech instructions.

Finally, we run the main loop, similarly to what we did before:

async def main():
  while True:
    audio_input = AudioInput(buffer=record_audio())
    result = await pipeline.run(audio_input)
    with AudioPlayer() as player:
      async for event in result.stream():
        if event.type == "voice_stream_event_audio":
          player.add_audio(event.data)

if __name__ == "__main__":
    asyncio.run(main())

You can find a complete implementation here.

Adding tools

Note that this loop will run forever, as we didn’t specify a stop condition. When executing it, you’ll have to kill the process manually to stop it. One way to implement stopping it is to provide a tool to the agent. 

Tools are functions that we provide to the agent so that it can execute them for us. In this case, we could provide a function that stops the script. The agent uses the function name and comment to decide whether it should call it.

from agents import function_tool
@function_tool
def stop_conversation():
  """Stop the conversation."""
  exit()

Then, we provide the tool to the agent:

agent = Agent(
  …
  tools=[stop_conversation], # Add this line when creating the agent
)

With this implementation, if we say, “I would like to stop the conversation,” the agent will understand that it needs to call the stop_conversation() function. Note that this implementation will not exit gracefully because of the way the stop_conversation() function is implemented.

Check this file if you want a complete script.

Using multiple agents

A neat feature of the agents API is that we can configure multiple agents to work together. Here’s an example taken from the OpenAI documentation:

from agents.extensions.handoff_prompt import prompt_with_handoff_instructions

spanish_agent = Agent(
  name="Spanish voice assistant",
  handoff_description="A spanish speaking agent.",
  instructions=prompt_with_handoff_instructions(
    "You're speaking to a human, so be polite and concise. Speak in Spanish.",
  ),
  model="gpt-4o-mini",
)

agent = Agent(
  name="Voice Assistant",
  instructions=prompt_with_handoff_instructions("""
    You're speaking to a human, so be polite and concise. 
    If the user speaks in Spanish, handoff to the spanish agent.
  """),
  model="gpt-4o-mini",
)

In this example, we define a second agent designed to speak Spanish and modify the first agent by providing a prompt with a handoff. When the handoff instructions are verified, the second agent will kick in and continue the conversation.

The full script is available here.

Conclusion

By using the advanced capabilities of OpenAI's latest audio models, we have created a system that can effectively transcribe spoken language into text and generate human-like speech from textual responses. This project demonstrates not only the potential of current technology but also how accessible these tools have become for developers interested in creating custom AI solutions.


François Aubry's photo
Author
François Aubry
LinkedIn
Full-stack engineer & founder at CheapGPT. Teaching has always been my passion. From my early days as a student, I eagerly sought out opportunities to tutor and assist other students. This passion led me to pursue a PhD, where I also served as a teaching assistant to support my academic endeavors. During those years, I found immense fulfillment in the traditional classroom setting, fostering connections and facilitating learning. However, with the advent of online learning platforms, I recognized the transformative potential of digital education. In fact, I was actively involved in the development of one such platform at our university. I am deeply committed to integrating traditional teaching principles with innovative digital methodologies. My passion is to create courses that are not only engaging and informative but also accessible to learners in this digital age.
Topics

Learn AI with these courses!

Course

Retrieval Augmented Generation (RAG) with LangChain

3 hr
4.8K
Learn cutting-edge methods for integrating external data with LLMs using Retrieval Augmented Generation (RAG) with LangChain.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Tutorial

OpenAI Realtime API: A Guide With Examples

Learn how to build real-time AI applications with OpenAI's Realtime API. This tutorial covers WebSockets, Node.js setup, text/audio messaging, function calling, and deploying a React voice assistant demo.
François Aubry's photo

François Aubry

15 min

Tutorial

OpenAI o1-preview Tutorial: Building a Machine Learning Project

Learn how to use OpenAI o1 to build an end-to-end machine learning project from scratch using just one prompt.
Abid Ali Awan's photo

Abid Ali Awan

15 min

Tutorial

OpenAI Assistants API Tutorial

A comprehensive overview of the Assistants API with our article, offering an in-depth look at its features, industry uses, setup guidance, and best practices to maximize its potential in various business applications.
Zoumana Keita 's photo

Zoumana Keita

14 min

Tutorial

How to use the OpenAI Text-to-Speech API

OpenAI’s TTS API is an endpoint that enables users to interact with their TTS AI model that converts text to natural-sounding spoken language.
Kurtis Pykes 's photo

Kurtis Pykes

12 min

Tutorial

Lovable AI: A Guide With Demo Project

Learn how to build and publish a mobile app using Lovable AI, integrating it with Supabase for backend services and GitHub for version control.
François Aubry's photo

François Aubry

8 min

Tutorial

DeepSeek V3: A Guide With Demo Project

Learn how to build an AI-powered code reviewer assistant using DeepSeek-V3 and Gradio.
Aashi Dutt's photo

Aashi Dutt

8 min

See MoreSee More