A Beginner’s Guide to the ElevenLabs API: Transform Text and Voice into Dynamic Audio Experiences

Harness the capabilities of the ElevenLabs API, a powerful AI voice generator. Learn how to transform text into speech and clone voices with this technology.

May 17, 2024 · 9 min read

In an era of technological innovation, the power of voice-driven applications is transforming how we interact with the world. From enhancing accessibility for those with visual impairments to creating more dynamic and engaging user experiences, the versatility of voice technology is broad and impactful.

ElevenLabs, a leading AI voice generator, offers cutting-edge capabilities in text-to-speech and speech-to-speech synthesis.

In this guide, we will explore these features and demonstrate how you can harness their potential to transform written text and voice recordings into lifelike speech.

Whether you are a developer looking to integrate voice functionality into your applications, a content creator seeking to produce multilingual voiceovers, or a business aiming to improve customer interactions through automated systems, this article will provide you with the essential knowledge to get started.

What are Text-to-Speech Engines?

A text-to-speech (TTS) engine is a sophisticated technology that receives written text as input and transforms it into spoken audio. This allows users to hear rather than read the text, making digital content more accessible and interactive. If you’re interested in exploring TTS engines further, detailed information can be found in this guide on the best text-to-speech engines.

What is ElevenLabs?

ElevenLabs is one of the leading platforms offering both text-to-speech (TTS) and speech-to-speech (STS) synthesis. This technology harnesses advanced machine learning models to generate realistic and natural-sounding audio from written text or even from one voice to another.

Its capability to deliver high-quality audio makes it a top choice among developers, content creators, and businesses aiming to enhance user engagement through dynamic audio experiences.

In this tutorial, we will focus on how to use the ElevenLabs API in Python. We’ll cover everything from obtaining your API keys, setting up your development environment, initializing the library and using its functions to produce speech. To get a sense of what you can achieve, listen to this audio clip that was generated using its functionality. We're going to learn how to create something similar, step by step.

Whether you’re looking to integrate voiceover into multimedia content or develop accessible applications for visually impaired users, ElevenLabs provides the tools necessary to bring your audio to life. By the end of this guide, you will have a thorough understanding of how to use its API to transform text into speech with unparalleled clarity and realism.

Text-to-speech Using the ElevenLabs API in Python

1. Create an API key

The first step is to sign up for a free ElevenLabs account. Once we’re logged in, we can click on the profile icon and select the “Profile + API key” option. Here, our API key will already be generated for us. We need to make sure to save this key as we will need it to authenticate our requests.

2. Install and import the ElevenLabs Python package

To interact with the ElevenLabs API using Python, we need to install their official package. We can do this using pip, the Python package installer:

$ pip install elevenlabs

Now, we can import the necessary components from the package into a new Python file.

from elevenlabs.client import ElevenLabs
from elevenlabs import play, save, stream, Voice, VoiceSettings

3. Generating audio

Once our environment is setup, we can generate our first audio clip by creating an instance of the ElevenLabs client using our API key.

client = ElevenLabs(api_key="YOUR_API_KEY")

Then, we use the .generate method to convert the text into audio.

audio = client.generate(
   text="Welcome to Datacamp's beginner's guide to the ElevenLabs API",
   voice="Brian"
)

We can immediately play the generated audio:

play(audio)

Or save it as a file:

save(audio, "output.mp3")

4. Customizing voices

ElevenLabs provides numerous customization options to tailor the voice to your preferences. We can adjust settings like stability, similarity boost, and style. For example:

audio = client.generate(
   text="Welcome to Datacamp's beginner's guide to the ElevenLabs API.",
   voice=Voice(
       voice_id='nPczCjzI2devNBz1zQrb',
       settings=VoiceSettings(
           stability=0.8, similarity_boost=0.6, style=0.2, use_speaker_boost=True)
   )
)

The voice_id corresponds to specific pre-made voices by ElevenLabs; a complete list, along with details such as use case, accent, and descriptions, can be found on the ElevenLabs voices page.

5. Multilingual speech generation

ElevenLabs offers two key models: eleven_multilingual_v2, capable of generating speech in 29 languages, and: eleven_monolingual_v1, which is optimized specifically for English speech. Here’s how we can utilize the multilingual model to produce audio in several languages simultaneously:

audio = client.generate(
   text="Hello! Hola! Hallo 你好! नमस्ते! Bonjour! こんにちは! مرحبا! 안녕하세요! Ciao!",
   voice="Arnold",
   model="eleven_multilingual_v2"
)

Streaming Speech Generation

While the .generate method we discussed processes and returns the entire speech output once all the text has been converted, the ElevenLabs API also offers a powerful streaming feature. This is particularly useful for applications requiring real-time audio generation, as it allows audio to be played back almost immediately while the rest of the text is still being processed.

NOTE: For streaming audio, we need to have the mpv media player installed, on mac, you can install it with the command brew install mpv. For Linux and Windows, you can install it from the mpv homepage.

How streaming works

To utilize the streaming feature, we need to set the stream parameter to True in the .generate method. This signals the API to begin delivering the audio in chunks as soon as they are ready:

audio_stream = client.generate(
   text="Welcome... I am speaking to you in real-time. Let’s get started!.",
   stream=True
)
stream(audio_stream)

Streaming with dynamic text input

Steaming is not just limited to static text. It can also handle dynamic input, where the text chunks are fed into the API as they become available. This is especially useful for interactive applications, such as live broadcasting or creating responsive AI-driven dialogues. Here’s how we can stream dynamic text:

def text_stream():
   yield "Hi! I'm Brian "
   yield "I'm an artificial voice made by ElevenLabs "


audio_stream = client.generate(
   text=text_stream(),
   voice="Brian",
   model="eleven_monolingual_v1",
   stream=True
)


stream(audio_stream)

In this setup text_stream() acts as a generator function that yields text snippets to the API. Each snippet is processed immediately, which helps to maintain a natural flow of speech without awkward pauses or delays. This feature mirrors the conversational abilities demonstrated by advanced voice assistants and real-time translation devices.

Speech-to-Speech Generation

Another powerful feature of the ElevenLabs platform is STS synthesis, which is essentially voice cloning.

The workflow is similar to what we’ve discussed so far for TTS, with the main difference being the input type. Instead of text, we provide audio files of the voice we want to clone.

For instant voice cloning, ElevenLabs suggest using 60 seconds of content with data free from background noise and effects, for professional voice cloning a minimum of 30 minutes of clean audio is recommended.

Adding a description is not necessary but can be useful for organizing and distinguishing between multiple projects or voice models.

Here’s an example of how we can clone a voice.

voice = client.clone(
   name="Emily",
   description="A young British female voice with a clear, melodic tone, ideal for storytelling or educational content", 
   files=["./sample_1.mp3", "./sample_2.mp3", "./sample_3.mp3"],
)

Applications of the ElevenLabs API in Python

The ElevenLabs API offers a range of powerful capabilities in speech synthesis that can transform the way organizations and individuals interact with their audience. By converting text to speech or cloning voices through speech-to-speech, this technology provides innovative solutions for various fields.

Here are three key applications that demonstrate the potential of this tool:

Interactive voice response (IVR) systems

Multi-language support: Organizations operating in multilingual markets can use the system to offer customer support in various languages without needing multilingual staff.
Customer service: Businesses can implement the API in their customer service operations to provide more human-like interactions in automated phone systems, improving customer experience with personalized voice responses.

Accessibility features in digital content

Enhanced reading tools: The API can be used to create audiobooks from written material, making literature more accessible to people with visual impairments or reading disabilities
Voice navigation: Integration into websites and apps for voice-guided navigation can help users who need auditory assistance, improving the usability of digital platforms.

Content creation

Automated voiceovers for videos: Producers of digital content, such as YouTube creators and filmmakers, can leverage the technology to generate natural-sounding voiceovers in multiple languages, significantly reducing production costs and time.
Educational tutorial and E-learning modules: With TTS and STS, we can voice educational content, making learning more interactive and accessible, especially in remote learning environments.

Conclusion

The ElevenLabs API is a robust and versatile tool for anyone looking to incorporate advanced speech synthesis into their applications.

Whether for accessibility, content creation, or enhancing customer interactions, this API provides an array of solutions to meet diverse needs.

With the option of both text-to-speech and speech-to-speech synthesis, it facilitates the creation of more inclusive and engaging user experiences.

If you’re excited about the possibilities of speech technology and want to deepen your expertise, consider taking the Spoken Language Processing in Python course on DataCamp. This comprehensive course covers everything from the basics of audio preprocessing and manipulation to converting speech-to-text and analyzing the transcribed data. By enrolling, you’ll gain hands-on experience with real-world applications, equipping you with the skills necessary to build sophisticated speech-enabled systems and create innovative solutions in this dynamic field.

Are there any usage costs associated with the ElevenLabs API?

What are the limitations of the ElevenLabs API?

Can the ElevenLabs API handle different accents and dialects?

Can I use the ElevenLabs API for commercial purposes?

How can I troubleshoot issues with the ElevenLabs API?

Author

Stanislav Karzhev

Topics

Python

Machine Learning

Learn More about Leveraging APIs Today!

Course

Working with the OpenAI API

3 hr

97.8K

Start your journey developing AI-powered applications with the OpenAI API. Learn about the functionality that underpins popular AI applications like ChatGPT.

See Details

Start Course

Course

Introduction to Embeddings with the OpenAI API

3 hr

15.5K

Unlock more advanced AI applications, like semantic search and recommendation engines, using OpenAI's embedding model!

See Details

Start Course

Course

Developing AI Systems with the OpenAI API

3 hr

15.7K

Leverage the OpenAI API to get your AI applications ready for production.

See Details

Start Course

Tutorial

Converting Speech to Text with the OpenAI Whisper API

Discover the powerful capabilities of OpenAI Whisper Python API for transcription and translation. It comes with multi-language support and prompt enhancement for accurate transcription.

Abid Ali Awan

Tutorial

How to use the OpenAI Text-to-Speech API

OpenAI’s TTS API is an endpoint that enables users to interact with their TTS AI model that converts text to natural-sounding spoken language.

Kurtis Pykes

Tutorial

Introduction to Text Embeddings with the OpenAI API

Explore our guide on using the OpenAI API for creating text embeddings. Discover their applications in text classification, information retrieval, and semantic similarity detection.