Direkt zum Inhalt

NVIDIA PersonaPlex Tutorial: Run a Natural, Real-Time Local Voice Assistant

What if real-time AI voice conversations felt natural, interruptible, and genuinely human? Learn how to run NVIDIA PersonaPlex locally and experience true full-duplex conversational AI.
4. Feb. 2026  · 8 Min. lesen

If you have ever used ChatGPT with voice mode, you already know how impressive it feels. Talking to an AI that listens, understands, and responds almost like a real person can be surprisingly natural. It feels conversational, intelligent, and responsive. But there is still a gap. Small delays, limited interruption handling, and the sense that you are still speaking to a system rather than a person.

Now imagine something better. This is exactly where PersonaPlex comes in.

Imagine a real-time voice assistant that responds instantly, lets you interrupt naturally, adapts mid-sentence, and feels far closer to a real human conversation. No noticeable latency. No awkward pauses. No cloud dependency. Just a smooth, local, real-time dialogue that feels alive.

PersonaPlex is a real-time, local voice interaction system developed by NVIDIA. It is designed to push voice-based AI beyond simple speech-to-text and text-to-speech pipelines. Instead, it enables low-latency, streaming conversations where the assistant can speak, listen, and adapt continuously, much like a real person would in a face-to-face conversation.

In this tutorial, I will walk you through PersonaPlex step by step. We will start by understanding what PersonaPlex is and why it feels fundamentally different from typical voice assistants. Then, we will set up the local environment, install PersonaPlex from source, and launch the PersonaPlex WebUI server. After that, we will interact with the system through the web interface and finally test real-time voice conversations using a Python script.

I recommend checking out the Spoken Language Processing in Python course to learn about some of the fundamentals behind PersonaPlex. 

What is PersonaPlex?

PersonaPlex is a new conversational AI system that makes voice interactions feel genuinely natural while still letting you fully customize the voice and persona. 

Instead of sounding like a typical assistant with pauses and rigid turn-taking, it enables smooth, real-time conversations where interruptions, timing, and tone feel human.

At its core, PersonaPlex uses a full-duplex architecture, meaning it can listen and speak at the same time. 

Rather than chaining together separate models for speech recognition, language understanding, and speech generation, it relies on a single unified model that updates continuously as the user speaks. 

Voice prompts define how the assistant sounds, while text prompts define who it is and how it should behave. This combination allows PersonaPlex to maintain a consistent persona while responding instantly and naturally.

PersonaPlex Architecture

PersonaPlex Architecture | Source: NVIDIA PersonaPlex

PersonaPlex is evaluated using FullDuplexBench and an extended customer service benchmark called ServiceDuplexBench. 

It outperforms other open-source and commercial systems in conversational dynamics, low-latency responses, interruption handling, and task adherence across both assistant and customer service scenarios.

PersonaPlex Benchmark results for Conversation Dynamics (Higher is better)

Conversation Dynamics (Higher is better) | Source: NVIDIA PersonaPlex

In the video demo below, you can clearly see the person having a fluid conversation with the model, exchanging banter and jokes in real time.

Setting Up the Environment 

Before we start, visit the Hugging Face model page for nvidia/personaplex-7b-v1 and accept the model usage conditions. 

PersonaPlex is a gated model, so you will need a Hugging Face API token. Generate the token from your Hugging Face account and keep it handy, as we will add it to the environment variables later to allow model access.

PersonaPlex agreement on the Hugging face: nvidia/personaplex-7b-v1

Next, go to RunPod and launch a new A40 GPU pod. Select the latest PyTorch image, then click Edit to customize the environment. 

Configuring Runpod A40 machine

Increase the container disk size to 50 GB, since the model itself is around 20 GB, and additional dependencies will also be downloaded. In the exposed HTTP ports section, add port 8998. Under environment variables, add HF_TOKEN and paste your Hugging Face API token.

Setting up environment for Personal plex int he Runpod

Once everything is configured, save the overrides and deploy the pod.

Deploy the pod

After the pod starts, you will see a link to the JupyterLab instance. Open it, then launch a terminal. You can access the machine via SSH or the web terminal, but using the Jupyter terminal is the simplest option.

Jupter lab instance on the Runpod

In the terminal, install the Opus audio codec development library, which is required for audio processing:

sudo apt update
sudo apt install libopus-dev

Installing Opus audio codec in Runpod

Installing PersonaPlex From Source

The PersonaPlex codebase is open-sourced by NVIDIA and available on GitHub (NVIDIA/personaplex: PersonaPlex code), making it easy to explore, customize, and run locally. 

Installing from source gives you full control over the setup and ensures compatibility with the latest updates in the repository.

NVIDIA/personaplex GitHub Repository

To get started, clone the PersonaPlex repository:

git clone https://github.com/NVIDIA/personaplex.git
cd personaplex

Git Cloning PersonaPlex repo

Then install Moshi, the core full-duplex speech model that PersonaPlex is built on. 

Moshi is responsible for real-time listening and speaking, enabling PersonaPlex to handle interruptions, pauses, and natural conversational timing without relying on a traditional ASR → LLM → TTS pipeline.

Installing it from source ensures that all audio, streaming, and conversational components are correctly set up for local execution.

pip install moshi/.

Once Moshi is installed, your environment is fully prepared to start the PersonaPlex server and begin interacting with the model in real time.

Starting the PersonaPlex WebUI Server

With the environment and dependencies in place, you can now start the PersonaPlex WebUI server. In the terminal, run the following command to launch the Moshi server that powers PersonaPlex:

python -m moshi.server --host 0.0.0.0 --port 8998

The first time you run this command, the server will automatically download the PersonaPlex model and other required files. This step can take a few minutes, depending on your network speed, as the model is fairly large.

Downloading the PersonaPlex model files

Once the download is complete, the server will start listening on port 8998. 

Running the Moshi server

To access the WebUI, go back to your RunPod dashboard. 

In the Connect section, find the exposed port 8998 and click the link provided. This will open the PersonaPlex WebUI in your browser, where you can start interacting with the model through real-time voice conversations.

Accessing the PersonalPlex WebUI

Testing PersonaPlex Using the WebUI

The PersonaPlex WebUI comes with several example prompts to help you get started, and you can also create your own custom prompt to define the assistant’s role and behavior. 

You can select a voice before connecting, which determines how the persona sounds during the conversation. 

For this tutorial, it is best to start with the default settings to get a feel for the system.

PersonaPlex WebUI

After clicking the Connect button, you will be taken to the interactive session. Here, you can speak directly to the model, hear its responses in real time, and see a live transcript of the conversation as it happens. 

The experience feels fluid, with natural timing and the ability to interrupt or respond quickly.

During my testing, the interaction felt surprisingly natural and engaging. 

I did notice some occasional stuttering, which was likely due to browser or system load rather than the model itself. Aside from that, the conversation felt very close to speaking with a real person.

Interacting with the PersonaPlex WebUI

Once you are comfortable, try experimenting with different prompts and voices. PersonaPlex supports a wide range of pre-packaged voice embeddings, including more natural conversational voices and more expressive variants:

  • Natural (female): NATF0, NATF1, NATF2, NATF3
  • Natural (male): NATM0, NATM1, NATM2, NATM3
  • Variety (female): VARF0, VARF1, VARF2, VARF3, VARF4
  • Variety (male): VARM0, VARM1, VARM2, VARM3, VARM4

PersonaPlex WebUI Voice modules

Switching voices and prompts is a great way to explore how PersonaPlex maintains personality, tone, and conversational flow across different roles and speaking styles.

Running PersonaPlex Offline With Python

The WebUI is great for real-time conversations, but you can also run PersonaPlex offline from Python. This is useful when you want repeatable outputs, want to test different prompts quickly, or want to generate audio files you can reuse in demos and experiments. 

The offline flow is simple: you provide (1) a voice prompt, (2) a text prompt, and (3) an input WAV file. PersonaPlex then generates an output audio response and a JSON file containing the model’s transcript.

Start a new notebook in Jupyter and change the directory to the cloned repository:

%cd personaplex

Run the following in a new cell. This calls the offline runner and writes both the generated audio and the transcript to disk:

%%capture
!python -m moshi.offline \
  --voice-prompt "NATF2.pt" \
  --text-prompt "You are a wise and friendly teacher. Answer questions in a clear, engaging way." \
  --input-wav "assets/test/input_assistant.wav" \
  --seed 42424242 \
  --output-wav "out_teacher.wav" \
  --output-text "out_teacher.json"

What each flag does:

  • --voice-prompt selects the voice embedding (here, a natural female voice).
  • --text-prompt defines the role and behavior of the assistant.
  • --input-wav is the recorded user audio the model will respond to.
  • --seed makes the output more reproducible between runs.
  • --output-wav is the generated spoken response.
  • --output-text saves the transcript output as JSON.

Once the command finishes, you can play the response audio directly:

from IPython.display import Audio

Audio("out_teacher.wav")

You should hear a clear, natural response in the selected voice, matching the persona defined in your text prompt.

Listening generated audio in the Jupyter notebook.

The JSON transcript can sometimes contain tokenization artifacts such as extra spaces around punctuation or split word pieces. The helper below loads the JSON and cleans it into readable text.

import json
import re

with open("out_teacher.json") as f:
    data = json.load(f)

def detokenize(tokens):
    # 1) drop padding-like tokens
    drop = {"PAD", "EPAD", "<pad>", "</s>", "<s>"}
    toks = [t for t in tokens if t not in drop]

    # 2) join with spaces first
    s = " ".join(toks)

    # 3) fix spacing around punctuation
    s = re.sub(r"\s+([.,!?;:])", r"\1", s)

    # 4) fix common split contractions: "it ' s" -> "it's"
    s = s.replace(" ' s", "'s").replace(" n't", "n't").replace(" 're", "'re").replace(" 'm", "'m").replace(" 've", "'ve").replace(" 'd", "'d")

    # 5) fix stray spaces around apostrophes
    s = re.sub(r"\s+'\s+", "'", s)

    # 6) fix cases like "for k" -> "fork" and "fl uff" -> "fluff"
    # (general rule: merge single-letter fragments if they look like split wordpieces)
    s = re.sub(r"\b([A-Za-z])\s+([A-Za-z]{1,3})\b", r"\1\2", s)

    # 7) collapse multiple spaces
    s = re.sub(r"\s{2,}", " ", s).strip()

    return s

clean_text = detokenize(data)  # replace with your list variable
clean_text

You should now see a readable transcript string that matches the generated audio. If your output still contains odd splits (for example, “fl uff” or “afor k”), that is normal for some runs and can be cleaned further with additional rules, but the main content should already be understandable.

"Hey, let me know if you have any questions. 
Hmm, first rinse the rice a couple of times until the water runs clear, that cuts down on starch, then use apot with a tight fitting lid, bring to a boil, 
give it a quick stir, then turn the heat down low and cover, let it s immer without lifting it, and when it'done fl uff it with afor k, that usually 
You could to ss the hot rice with a nice handful of chopped fresh herbs like basil or par sley, or you could sprinkle a little g rated cheese, 
a squeeze of lemon or lime, adr izzle of olive oil, some chopped fresh herbs, or even some to ast ed nuts, that adds color and flavor."

Final Thoughts

Testing PersonaPlex genuinely surprised me. From the very first interaction, it felt less like experimenting with a model and more like having a real conversation. Being able to interrupt naturally, get instant responses, and maintain a consistent personality throughout made it feel far ahead of most voice systems I have tried. Running everything locally made the experience even more impressive, with no noticeable delay or loss of control.

There are still some drawbacks. I noticed occasional stuttering, and once the conversation moves on, it does not always handle returning to an earlier topic smoothly. 

It also does not fully understand non-English accents yet, which can lead to mispronounced names or imperfect transcriptions. These feel like edge cases rather than fundamental issues, and I am confident they will improve quickly.

I am sure that in the future these issues will be resolved, and we will get an even better, fully local and real-time conversational AI that combines the reasoning of GPT-5.2 with the voice quality of ElevenLabs.

PersonaPlex FAQs

What hardware do I need to run PersonaPlex locally?

NVIDIA recommends a GPU with at least 24 GB of VRAM (like an A10G, A40, or RTX 3090/4090) to run the 7B model smoothly with low latency. You also need a Linux environment with CUDA support. While it is possible to offload layers to the CPU, this will significantly degrade the real-time performance that makes the model special.

Can PersonaPlex speak languages other than English?

Currently, the v1 release is English-only. The architecture supports multiple languages, and NVIDIA has indicated that support for other languages (such as Spanish) is on the roadmap for future updates, but for now, it is optimized for English conversation.

Can I use this for commercial applications?

Yes. The model weights are released under the NVIDIA Open Model License, which generally permits commercial use. However, you should review the specific license agreement on the Hugging Face model card to ensure your use case (e.g., hosting a paid service) complies with their terms.

How is this different from using faster-whisper and a fast LLM?

Traditional systems are "half-duplex"—they wait for you to finish speaking, transcribe the audio, think, and then speak. PersonaPlex is "full-duplex," meaning it processes audio and generates tokens continuously. This allows it to listen while speaking, handle interruptions naturally, and generate backchannels ("uh-huh", "right") without the awkward latency of turn-based systems.

Why does the model sometimes mispronounce names or split words in the transcript?

Because PersonaPlex streams audio tokens in real-time frames (24kHz), text tokens are sometimes generated across split frames. This can result in artifacts like "fl uff" or "afor k" in the raw transcript. These are normal side effects of streaming tokenization and can be cleaned up with simple post-processing scripts.


Abid Ali Awan's photo
Author
Abid Ali Awan
LinkedIn
Twitter

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Themen

Top DataCamp Courses

Lernpfad

Grundlagen von Hugging Face

12 Std.
Finde die neuesten Open-Source-KI-Modelle, Datensätze und Apps, entwickle KI-Agenten und optimiere LLMs mit Hugging Face. Werde noch heute Teil der größten KI-Community!
Details anzeigenRight Arrow
Kurs starten
Mehr anzeigenRight Arrow
Verwandt

Blog

Perplexity vs. ChatGPT: Which AI Assistant Fits Your Needs Best?

A practical guide to choosing the right AI assistant for research, creativity, and everyday tasks.
Vinod Chugani's photo

Vinod Chugani

8 Min.

Blog

DataCamp's New Learn to Prompt Experience

This new AI learning model accelerates learning by demonstrating how small changes to your prompts create dramatically different results in real time.
Matt David's photo

Matt David

3 Min.

Podcast

Developments in Speech AI with Alon Peleg & Gill Hetz, COO and VP of Research at aiOla

Richie, Alon, and Gill explore speech AI, its components like ASR, NLU, and TTS, real-world applications, challenges like accents and background noise, and the future of voice interfaces in technology, and much more.

Tutorial

OpenAI's Audio API: A Guide With Demo Project

Learn how to build a voice-to-voice assistant using OpenAI's latest audio models and streamline your workflow using the Agents API.
François Aubry's photo

François Aubry

Tutorial

OpenAI Realtime API: A Guide With Examples

Learn how to build real-time AI applications with OpenAI's Realtime API. This tutorial covers WebSockets, Node.js setup, text/audio messaging, function calling, and deploying a React voice assistant demo.
François Aubry's photo

François Aubry

Tutorial

Google Workspace Studio Tutorial: Build an AI Agent Using Natural Language

Learn how Google Workspace Studio lets you create AI agents to automate workflows in Gmail, Drive, and Sheets with an easy, hands-on beginner tutorial.
Aryan Irani's photo

Aryan Irani

Mehr anzeigenMehr anzeigen