Corso
Until March 2026, Mistral had built half a voice pipeline. Its Voxtral Transcribe models handled speech-to-text, but there was no way to go the other direction. If you wanted your application to speak, you had to reach for a different provider, typically ElevenLabs for expressive voice cloning, OpenAI's TTS-1 for a minimal integration, or Google Cloud's Neural2 if you were already in that ecosystem.
Voxtral TTS fills that gap. It is Mistral's first text-to-speech model, a 4.1-billion-parameter system that generates speech in nine languages from a short audio reference. The model weights are available on Hugging Face, and the cloud API is live at $0.016 per 1,000 characters, with open weights available for self-hosting.
One naming note: the earlier Voxtral models from July 2025 are speech-to-text models. Voxtral TTS, released March 26, 2026, goes the other direction.
What Is Voxtral TTS?
Voxtral TTS is a text-to-speech model that converts written text into spoken audio in nine languages: English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. You can use one of the 20 built-in preset voices, or supply a reference audio clip to clone a specific speaker's voice. The recommended clip length for cloning is 5 to 25 seconds, though the model accepts as little as 3 seconds.
Mistral's main claim is a small footprint combined with what it describes as frontier-level quality. The default BF16 weights on Hugging Face are about 8 GB, and self-hosting requires a GPU with at least 16 GB of VRAM to cover the inference overhead. The research paper notes that quantized versions can bring the weights down to around 3 GB, though those are not the default release. Pierre Stock, Mistral's VP of Science, told VentureBeat the model can even run on a smartphone, though that depends on quantization and I could not verify it independently.
The model also supports what Mistral calls "Voice-as-an-instruction." Instead of relying on SSML tags or explicit emotion labels to control how the generated speech sounds, Voxtral TTS infers the tone, rhythm, and emotional delivery directly from the voice reference you provide. Give it a reference clip of someone speaking excitedly, and the generated speech tends to reflect that delivery. This is a different approach from ElevenLabs, which uses explicit emotion tags to steer generation.
Architecture overview
Voxtral TTS is a transformer-based, autoregressive, flow-matching model built on Ministral 3B. It has three components: a 3.4B-parameter transformer decoder backbone that predicts semantic tokens from text and voice input; a 390M-parameter flow-matching acoustic transformer that converts those tokens into audio representations; and a 300M-parameter neural audio codec Mistral built from scratch, operating at 12.5 Hz with 80-millisecond frames. The codec is what keeps the audio representation efficient: it operates in a tightly compressed latent space, which is why the full 4.1B-parameter model can generate high-quality audio while keeping its BF16 weights at ~8 GB.

Architecture overview of Voxtral TTS. Source: Mistral AI.
This three-stage pipeline is also what makes voice cloning possible: the codec captures speaker characteristics in the latent space, which the backbone and acoustic transformer then use to reproduce that voice on new text.
Zero-shot voice cloning
With voice cloning, you provide a short reference clip and the model generates speech that captures the speaker's accent, intonation, and rhythm, including natural pauses and pacing.
What surprised me in the research is the cross-lingual capability. If you provide a French voice reference and type your prompt in German, the model tends to generate German speech that sounds similar to the French speaker, picking up much of their accent and vocal characteristics. This is not something the model was explicitly trained to do. It is emergent behavior, and it could be useful for speech-to-speech translation where keeping the original speaker's voice across languages matters.
One practical detail: custom voice cloning requires the Mistral API. The open-weight release is limited to those same 20 preset voices. If you want to clone a specific voice on the self-hosted version, you need the API's voice creation endpoint.
Voxtral TTS Benchmarks
All benchmark data here comes from Mistral's own internal evaluations. The model is recent enough that, as of this writing, no independent third-party benchmarks have been published. The Artificial Analysis Speech Arena Leaderboard, an independent TTS ranking, has not yet added Voxtral TTS.
Mistral used human preference evaluations rather than automated metrics like Mean Opinion Score (MOS), arguing in the research paper that automated scores don't reliably capture naturalness across languages and cultures. The tests were blind listening comparisons run by native speaker annotators across all nine supported languages.
In tests using each model's built-in flagship voices, Voxtral TTS was preferred by human annotators in 58.3% of comparisons. In zero-shot voice cloning tests, where both models received a short reference clip and generated speech from the same text, that rate rose to 68.4%. The gap was widest in Hindi (roughly 80% preference) and Spanish (roughly 88% preference). Dutch was a weak spot, though, at 49.4%, meaning ElevenLabs Flash v2.5 held the edge in that language.

Human preference results from Mistral evaluations. Source: Mistral AI.
Mistral also ran a separate set of emotion-steering tests, this time comparing against ElevenLabs v3 and Gemini 2.5 Flash TTS rather than Flash v2.5. Against ElevenLabs v3, the two models were roughly even on explicit steering; Voxtral had a slight edge on implicit steering. Against Gemini 2.5 Flash TTS, Gemini came out ahead at about 65%, making Gemini the better performer in that comparison. These numbers, like all in this section, are from Mistral's own paper.
On latency, Mistral reports 70 milliseconds on a single NVIDIA H200. End-to-end time-to-first-audio from the API is about 0.8 seconds with PCM and roughly 1.5 to 2 seconds with MP3.
Testing Voxtral TTS: Hands-On Examples
I focused on three scenarios. First, whether a preset voice holds up for general narration, specifically whether naturalness breaks down where TTS systems typically fail. Second, how well voice cloning captures a real speaker from a short clip, and how much clip length actually matters. Third, how much the output format affects latency in practice.
Setting up
First, install the official Python SDK and set your API key. You will need a Mistral account with billing enabled.
pip install mistralai
Set your API key as an environment variable:
export MISTRAL_API_KEY="your-api-key-here"
Example 1: Basic speech generation
The preset voices cover American English, British English, and French dialects and are a reasonable starting point for general narration where speaker identity does not matter.

Mistral Studio playground for Voxtral TTS. Image by Author.
Here is the simplest version, using a preset voice:
import base64
import os
from pathlib import Path
from mistralai.client import Mistral
client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))
# Generate speech from text
response = client.audio.speech.complete(
model="voxtral-mini-tts-2603",
input="Welcome to Voxtral TTS. This is a basic speech generation test.",
voice_id="your-voice-id", # Use a voice ID from your account
response_format="mp3",
)
# Save the audio file
Path("basic_output.mp3").write_bytes(base64.b64decode(response.audio_data))
print("Saved to basic_output.mp3")
A couple of things to note. The method is .complete(), not .create() as you might expect if you are coming from OpenAI's TTS API. The response returns base64-encoded audio in response.audio_data, so you need to decode it before writing to disk.
The first thing I checked was sentence-ending rhythm. Many TTS models drop mechanically at the final period, a flat cadence that immediately marks something as synthetic. That held up here. What surprised me was the comma pauses mid-sentence: cheaper systems tend to treat them as hard stops, but the pacing stayed continuous. The pronunciation was accurate throughout, including on the word "Voxtral" itself, which I expected to be a stumble.
Example 2: Creating a custom voice
When speaker identity matters, the API lets you create a reusable voice profile from a short reference clip.
import base64
import os
from pathlib import Path
from mistralai.client import Mistral
client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))
# Encode your reference audio as base64
sample_audio_b64 = base64.b64encode(
Path("reference_voice.mp3").read_bytes()
).decode()
# Create a reusable voice profile
voice = client.audio.voices.create(
name="my-custom-voice",
sample_audio=sample_audio_b64,
sample_filename="reference_voice.mp3",
languages=["en"],
gender="male",
)
print(f"Created voice with ID: {voice.id}")
# Generate speech using the cloned voice
response = client.audio.speech.complete(
model="voxtral-mini-tts-2603",
input="This sentence was generated using a cloned voice from just a few seconds of reference audio.",
voice_id=voice.id,
response_format="mp3",
)
Path("cloned_output.mp3").write_bytes(base64.b64decode(response.audio_data))
print("Saved to cloned_output.mp3")

Terminal output from voice cloning script. Image by Author.
My first attempt used a clip that was about three seconds long. The output sounded plausible but generic: right vocal range, but missing the specific inflections that make a voice identifiable. It could have been anyone. When I switched to an eight-second clip, the difference was clear: the cloned output captured the speaker's accent, rhythm, and the slight rise at the ends of questions that the shorter clip had completely flattened.
The cloned voice did not sound identical to the source, but it was recognizably in the same register with similar pacing. From my testing, 8 to 15 seconds is a good middle ground between effort and result quality.
Example 3: Format comparison for latency
The output format affects how quickly the first audio chunk arrives. I tested two formats.
import os
import time
from mistralai.client import Mistral
client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))
text = "This is a latency test comparing PCM and MP3 output formats from Voxtral TTS."
voice_id = "your-voice-id"
def time_to_first_chunk(response_format):
start = time.time()
stream = client.audio.speech.complete(
model="voxtral-mini-tts-2603",
input=text,
voice_id=voice_id,
response_format=response_format,
stream=True,
)
for _ in stream:
return time.time() - start # return on first chunk
pcm_time = time_to_first_chunk("pcm")
mp3_time = time_to_first_chunk("mp3")
print(f"PCM latency: {pcm_time:.2f}s")
print(f"MP3 latency: {mp3_time:.2f}s")

Latency comparison across two output formats. Image by Author.
I ran the MP3 test first, since that is the natural default for most audio workflows. The result was functional at around 1.5 to 2 seconds, but for a voice agent, that is a noticeable pause before any audio starts playing. Mistral's documentation cites up to 3 seconds for MP3; I did not see that in practice, though network conditions will vary. PCM came in around 0.6 to 0.9 seconds, consistent with Mistral's reported ~0.8 second figure.
The tradeoff is that PCM is raw, uncompressed audio, so your application needs to handle or convert it before standard playback. If you are saving to a file or feeding a standard audio player, MP3 is simpler. If you control the audio stack directly, as in a voice agent pipeline, PCM is the practical choice on latency.
The API supports five output formats: MP3, WAV, PCM (raw float32), FLAC, and Opus. The model generates up to two minutes of audio in a single pass. For longer content, the API handles it automatically using what Mistral calls "smart interleaving": it splits the text into chunks, synthesizes each one, and joins them without audible gaps.
I also put together a short Streamlit app that combines all three examples into one interface. Here is a quick walkthrough of how it works:
All the code from this tutorial is available in this GitHub repository.
Voxtral TTS Pricing
Voxtral TTS is $16 per million characters through the cloud API. A free tier is available for testing, and the open weights on Hugging Face are free for non-commercial use under CC BY-NC 4.0.
Here is how that compares to the main alternatives as of early 2026.
|
Provider |
Model |
Price per 1M Characters |
Open Weights |
|
Mistral |
Voxtral TTS |
$16 |
Yes (non-commercial) |
|
OpenAI |
TTS-1 |
$15 |
No |
|
OpenAI |
TTS-1-HD |
$30 |
No |
|
Google Cloud |
Neural2 |
$16 |
No |
|
Google Cloud |
Chirp 3 HD |
$30 |
No |
|
Amazon Polly |
Neural |
$16 |
No |
|
Azure Speech |
Neural TTS |
$15-16 |
No |
|
Deepgram |
Aura-2 |
$30 |
No |
|
ElevenLabs |
Flash v2.5 |
~$25-110 (varies by plan) |
No |
Voxtral TTS matches the mid-tier cloud providers on price, not the budget tier. Mistral has called it "a fraction of anything else on the market," but that holds mainly against ElevenLabs. Against OpenAI TTS-1, Google Neural2, and Amazon Polly, the price is essentially the same.
The open weights are the main distinction: no other provider in the table offers self-hostable weights. For non-commercial projects, that means free local deployment via vLLM-Omni. For enterprises, Mistral offers on-premise options through its Forge platform.
The tradeoff is language coverage. Nine languages versus 70-plus for ElevenLabs. If your use case goes beyond the supported list, that gap matters.
Conclusion
In this tutorial, we ran three API examples: basic speech generation, voice cloning, and output format latency. But there's much more you can do.
Whether you are building a voice agent, a multilingual narration tool, or anything where audio data needs to stay on your own infrastructure, Voxtral TTS is a practical option to evaluate. Before going to production, note that the benchmarks are self-reported, the CC BY-NC license restricts commercial self-hosting, and self-hosting currently requires vLLM-Omni. The Mistral documentation and the research paper are good starting points.
I’m a data engineer and community builder who works across data pipelines, cloud, and AI tooling while writing practical, high-impact tutorials for DataCamp and emerging developers.
FAQs
Does Voxtral TTS require GPUs?
For the cloud API, no. GPU requirements only come up if you self-host the open weights. The default BF16 weights are about 8 GB and need at least 16 GB of VRAM for inference. The research paper mentions a quantized version around 3 GB, but that is not the default Hugging Face release.
Can I use the open weights commercially?
The weights are CC BY-NC 4.0, so non-commercial only. The cloud API has no such restriction. If you need on-premise commercial deployment, Mistral offers enterprise licensing. Note that their speech-to-text models use the more permissive Apache 2.0 license, which is a different situation.
How does Voxtral TTS handle languages it does not support?
It does not throw an error. Based on reports, unsupported language input tends to produce degraded output without warning. Verify the language on your end before sending the request.
Can I use Voxtral TTS for real-time voice agents?
Yes. Time-to-first-audio runs roughly 0.8 seconds with PCM and 1.5 to 2 seconds with MP3, based on Mistral's figures and my own testing. That is workable for voice agents, though PCM is the better choice if response speed matters.
Does voice cloning quality vary by language?
Yes. According to Mistral's own evaluations, Hindi and Spanish produced the strongest results, while Dutch was the weak spot. That data is self-reported, so treat it as directional, but worth knowing if you are building for a specific language.

