मुख्य सामग्री पर जाएं

Multi-Token Prediction Tutorial: How To Speed Up LLMs

Run Qwen3.6 27B on an RTX 3090 and learn how Multi-Token Prediction (MTP) with llama.cpp can boost local LLM inference by almost 2x without upgrading your GPU.
14 मई 2026  · 9 मि॰ पढ़ना

What if you could make large language models run faster without upgrading your GPU, changing your machine, or switching to a smaller model?

That is what we will test in this guide using Multi-Token Prediction, or MTP. In my benchmark, the same Qwen3.6 27B model on the same RunPod RTX 3090 setup improved from 38 tokens/sec to 65 tokens/sec after enabling MTP. That is a 1.71x speedup, or around 71% higher throughput, with no visible loss in output quality.

In this guide, we will:

  • Set up a RunPod RTX 3090 machine
  • Clone and switch to the MTP branch
  • Build llama.cpp with CUDA support
  • Download the Qwen3.6 27B MTP GGUF model
  • Run the model without MTP to get a baseline speed
  • Enable MTP and test the model again
  • Compare the token generation speed with and without MTP
  • Look at TurboQuant as a possible next step for further optimization

Associate AI Engineer

Train and fine-tune the latest AI models for production, including LLMs like Llama 4. Start your journey to becoming an AI Engineer today!
Explore Track

What Is Multi-Token Prediction?

Most LLMs generate text one token at a time. The model predicts the next token, adds it to the context, and repeats the same process again. This is reliable, but it can be slow because each new token usually requires another decoding step.

Multi-Token Prediction changes this by allowing the model to look ahead and propose multiple future tokens instead of only one. These proposed tokens are then checked by the main decoding process. If the predictions are correct, the model can accept several tokens in one go. If one token is wrong, the model falls back to the normal path from that point.

In practice, MTP works like a built-in drafting mechanism. The model drafts a few likely next tokens, verifies them, and keeps the valid ones. The more draft tokens that get accepted, the fewer full decoding steps are needed, which can increase tokens per second without changing the final output quality.

In simple terms:

  • Without MTP: Generate token 1 → generate token 2 → generate token 3
  • With MTP: Draft multiple tokens → verify them → accept valid tokens together

This is why MTP can make local LLM inference feel much faster. Instead of forcing the model to move forward one tiny step at a time, it lets the model safely jump ahead whenever its draft predictions are correct. 

Without MTP vs with MTP

In tools like llama.cpp and vLLM-style implementations, this is closely related to speculative decoding, where draft tokens are accepted only when they match the verifier’s output. 

1. Set Up a RunPod RTX 3090 Machine

For this guide, I used a RunPod GPU instance with an RTX 3090. You can use another CUDA-enabled GPU, but the benchmark results in this tutorial are based on an RTX 3090 setup.

First, create a new RunPod pod and select an RTX 3090 GPU.

Before deploying the pod, edit the template settings:

  • Increase the volume disk size to 100 GB

  • Add an additional HTTP port: 8910

  • Add an environment variable called HF_TOKEN and set its value to your Hugging Face access token. 

The extra HTTP port will let you access the llama.cpp server and web UI from your browser. The Hugging Face token helps authenticate the download request and can improve model download speed, especially for large GGUF files.

RunPod configuration

After updating the template, deploy the pod. Once it is running, wait for RunPod to give you access to the JupyterLab instance. Open JupyterLab, then launch a new terminal.

Inside the terminal, install the required system packages:

apt update
apt install -y git cmake build-essential curl wget python3-pip

Starting the RunPod

2. Clone and Switch to the MTP Branch

Next, move into the workspace directory where we will install and build llama.cpp:

cd /workspace

Clone the llama.cpp repository:

git clone --depth 1 https://github.com/ggml-org/llama.cpp.git
cd llama.cpp

The MTP changes are still being tested through a dedicated llama.cpp pull request, so we fetch and switch to that branch to use the latest MTP implementation before it becomes part of the standard main build.

Fetch the MTP branch locally:

git fetch origin pull/22673/head:mtp-pr
git checkout mtp-pr

This switches your local llama.cpp build to the MTP-enabled version, which we will use for the rest of the guide.

Switching llama.cpp build to the MTP-enabled version

3. Build llama.cpp with CUDA Support

Now that you are on the MTP-enabled branch, build llama.cpp with CUDA support. This allows the model to use the RTX 3090 GPU instead of running inference on the CPU.

Run the CMake build configuration:

cmake -B build -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release

Then compile the two targets we need for this guide:

cmake --build build --target llama-cli llama-server -j

Build llama.cpp with CUDA support

This builds:

  • llama-cli for running quick command-line tests

  • llama-server for launching an OpenAI-compatible server with browser access

Once the build is complete, copy the llama-server binary into the main llama.cpp directory:

cp ./build/bin/llama-server ./llama-server

This makes it easier to run the server from the project root in the next steps.

4. Download the Qwen3.6-27B-MTP Model

Next, download the Qwen3.6 27B MTP GGUF model that we will use for testing. This is the model we will run first without MTP, and then again with MTP enabled to compare the speed difference.

First, install the Hugging Face download tools:

pip install -U "huggingface_hub[hf_xet]" hf-xet hf_transfer

Then enable faster Hugging Face downloads:

export HF_HUB_ENABLE_HF_TRANSFER=1

This helps speed up large model downloads, especially when working with GGUF files.

Now create a dedicated directory for the model:

mkdir -p /workspace/models/qwen3.6-mtp

Download the Qwen3.6 27B MTP GGUF model:

hf download froggeric/Qwen3.6-27B-MTP-GGUF \
 Qwen3.6-27B-Q4_K_M-mtp.gguf \
 --local-dir /workspace/models/qwen3.6-mtp

Downloaded Qwen3.6-27B successfully

If you’re interested in fine-tuning LLMs, check out my tutorial on fine-tuning Qwen3.6 on a medical Q&A dataset.

5. Run Qwen3.6-27B Without MTP Enabled

Now we are at the main part of the guide: testing the model speed before and after enabling MTP.

First, we will run the model without MTP. This gives us a clean baseline so we can compare the speed difference later. We are using the same model, same GPU, same context size, and same server settings. The only major change in the next step will be enabling MTP.

Move back into the llama.cpp directory:

cd /workspace/llama.cpp

Start the server without MTP:

./llama-server \
-m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf" \
--alias qwen3.6-27b-no-mtp \
--host 0.0.0.0 \
--port 8910 \
-ngl 99 \
-c 100000 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-np 1 \
-b 2048 \
-ub 512 \
-t 8 \
-fa on \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--repeat-penalty 1.1 \
--metrics

This starts an OpenAI-compatible llama.cpp server on port 8910.

The model may take a short time to load because the server needs to load the model weights into GPU memory. Once everything is ready, the terminal will show that the server is available on port 8910.

server available

Because we exposed this port when setting up the RunPod template, you do not need to configure anything else. Go back to your RunPod dashboard and click the link associated with port 8910. This will open the llama.cpp web UI in your browser, with the local model already loaded.

llama.cpp web UI

From there, you can start testing prompts directly in the browser, similar to how you would use a chat interface.

Chat interface

In my baseline test, the model generated responses at around 38.86 tokens/sec without MTP. Even with a more complex prompt, the speed stayed around the same range. 

For a 27B model running on an RTX 3090, this is already a usable result, especially considering the GPU is slower and has limited memory compared to newer data center cards. 

6. Run Qwen3.6-27B With MTP Enabled

Now we will run the same model again, but this time with MTP enabled.

Go back to the terminal where the server is running and stop it with:

CTRL + C

The important thing here is that we are not changing the model, GPU, quantization, or most runtime settings. We are only adding two MTP-related flags:

--spec-type mtp
--spec-draft-n-max 3

The first flag tells llama.cpp to use MTP-style speculative decoding. The second flag sets the maximum number of draft tokens to 3. This means the model can attempt to draft up to three future tokens before verification.

Now start the server again with MTP enabled:

./llama-server \
-m "/workspace/models/qwen3.6-mtp/Qwen3.6-27B-Q4_K_M-mtp.gguf" \
--alias qwen3.6-27b-mtp \
--host 0.0.0.0 \
--port 8910 \
-ngl 99 \
-c 100000 \
--spec-type mtp \
--spec-draft-n-max 3 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
-np 1 \
-b 2048 \
-ub 512 \
-t 8 \
-fa on \
--temp 0.7 \
--top-k 20 \
--top-p 0.95 \
--repeat-penalty 1.1 \
--metrics

Once the server is ready, refresh the browser page. If the page does not reconnect automatically, close it and open the port 8910 link again from your RunPod dashboard. 

Now test the model again using the same type of prompts.

Testing Qwen3.6 with MTP

With MTP enabled, the speed increased noticeably. For a simple greeting prompt, the model reached around 65–67 tokens/sec. Compared to the baseline speed of around 38.86 tokens/sec, this is a major improvement from adding only two command-line flags. 

Model output

For a more complex prompt, such as asking the model to build a simple game in Python, the speed was slightly lower but still much faster than the non-MTP baseline. In that test, the model generated at around 56–61 tokens/sec, which is still a strong result for a 27B model on an RTX 3090. 

Overall, enabling MTP improved Qwen3.6 27B from around 38 tokens/sec to 65 tokens/sec on the RunPod RTX 3090 setup. That gives a 1.71x speedup, or around 71% higher throughput, without changing the hardware or switching to a smaller model. 

7. Recommendation: Further Speed Optimization with TurboQuant

The benchmark in this guide uses the original llama.cpp MTP setup, without adding TurboQuant, custom patches, or other runtime-level optimizations. This keeps the test simple, reproducible, and focused on the speed gain from enabling MTP alone.

To push performance further, the next optimization to explore is MTP and TurboQuant together. MTP improves throughput by allowing the model to accept multiple predicted tokens, while TurboQuant helps reduce KV-cache memory pressure during inference. 

This can be especially useful for larger models, long-context prompts, and GPUs like the RTX 3090, where memory bandwidth and VRAM can become limiting factors.

This is why some r/LocalLLaMA community results report higher tokens/sec than this guide. Those setups often combine MTP with TurboQuant, patched builds, different KV-cache settings, or faster GPUs. Since this tutorial focuses on a clean MTP-only benchmark, TurboQuant should be treated as the recommended next experiment rather than part of the current setup.

Final Thoughts 

Recently, I have been following posts in the LocalLLaMA Reddit community, and it is amazing to see how far local LLM inference has come. People are now running models like Qwen3.6 27B as local coding agents, even on older GPUs with limited VRAM. Some are also running similar setups on Mac systems, and the results are genuinely impressive.

After testing MTP myself, I can see why there is so much excitement. With the same model and the same RTX 3090 setup, enabling Multi-Token Prediction improved generation speed from around 38 tokens/sec to 65 tokens/sec. That is almost a 2x speedup without upgrading the GPU or switching to a smaller model.

This guide focused on a simple and reproducible MTP setup using llama.cpp, but this feels like only the beginning. The next step is to experiment with better GGUF quantization, MTP, TurboQuant, and more tuned runtime settings to see how much further local inference speed can be pushed.

For me, the most exciting part is what this means for local coding agents. You can run powerful models on your own hardware, reduce cost per query, keep your code private, and use an AI coding assistant without depending entirely on internet-based APIs. Local LLMs are becoming faster, more practical, and much more useful than they were even a short time ago.

Multi-Token Prediction FAQs

Do I need a separate draft model for MTP?

No. With Qwen3.6-27B, MTP is built into the model itself, so no second model is required.

How much faster does MTP make the model?

In our RunPod RTX 3090 setup, enabling MTP improved generation speed from ~38 tokens/sec to ~65 tokens/sec, a 1.71x speedup, or ~71% higher throughput.

What is the difference between MTP and speculative decoding?

MTP in llama.cpp is a form of speculative decoding. Draft tokens from the model's own MTP heads are accepted only if they pass verification. The key difference from traditional speculative decoding is that no external draft model is needed.

Can I get even faster speeds beyond what MTP offers?

Yes. Combining MTP with TurboQuant, which reduces KV-cache memory pressure during inference, is the recommended next step for further speed gains, especially on memory-constrained GPUs like the RTX 3090.

विषय

Learn AI Engineering with DataCamp!

Track

डेवलपर्स के लिए एसोसिएट AI इंजीनियर

26 घंटा
एपीआई और ओपन-सोर्स लाइब्रेरी का उपयोग करके सॉफ़्टवेयर अनुप्रयोगों में AI को एकीकृत करना सीखें। आज ही AI इंजीनियर बनने की अपनी यात्रा शुरू करें!
विस्तृत जानकारी देखेंRight Arrow
कोर्स शुरू करें

Track

LangChain के साथ AI इंजीनियरिंग

21 घंटा
प्रॉम्प्ट इंजीनियरिंग से लेकर एजेंटिक सिस्टम्स तक—स्केल करने वाले AI एप्लिकेशन बनाने के लिए पूरा कौशल सेट विकसित करें, आपके साथ एक AI ट्यूटर के साथ।
और देखेंRight Arrow
संबंधित

tutorial

How to Run GLM-4.7 Locally with llama.cpp: A High-Performance Guide

Setting up llama.cpp to run the GLM-4.7 model on a single NVIDIA H100 80GB GPU, achieving up to 20 tokens per second using GPU offloading, Flash Attention, optimized context size, efficient batching, and tuned CPU threading.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

How to Run Qwen3.5-27B Locally for Agentic Coding

Set up vLLM on a H100 GPU, serve Qwen3.5-27B, connect OpenCode, and test fast agentic coding with long context support.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

Unsloth Guide: Optimize and Speed Up LLM Fine-Tuning

Fine-tuning the Llama 3.1 model to solve specialized algebra problems with high accuracy and detailed results using Unsloth.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

Fine-Tune and Run Inference on Google's Gemma Model Using TPUs for Enhanced Speed and Performance

Learn to infer and fine-tune LLMs with TPUs and implement model parallelism for distributed training on 8 TPU devices.
Abid Ali Awan's photo

Abid Ali Awan

tutorial

Unsloth Studio Fine-Tuning LLMs Guide

Learn how to fine-tune Qwen3.5-9B on the Latex OCR dataset using QLoRA in Unsloth Studio, then test the model in chat and export it in GGUF format for local inference.
Abid Ali Awan's photo

Abid Ali Awan

llama 4 fine tuning

tutorial

Fine-Tuning Llama 4: A Guide With Demo Project

Learn how to fine-tune the Llama 4 Scout Instruct model on a medical reasoning dataset using RunPod GPUs.
Abid Ali Awan's photo

Abid Ali Awan

और देखेंऔर देखें