How to Speed Up Local LLMs with DFlash Speculative Decoding

Learn how to accelerate local Gemma 4 31B inference on a single RTX 4090 using DFlash speculative decoding and Flash Attention against a baseline setup.

Jun 17, 2026 · 11 min read

Explore with AI

Open in ChatGPT Open in Claude Open in Perplexity

Over the past few weeks, I have been seeing a lot of excitement in the r/LocalLLaMA community around speculative decoding, DFlash, better KV cache handling, and optimized llama.cpp builds. The interesting part is that people are getting major speedups without upgrading their hardware.

In this guide, we will run Gemma 4 31B IT locally on an RTX 4090 24GB using BeeLlama.cpp, a fork of llama.cpp that supports DFlash speculative decoding.

We will test the model in two ways. First, we will run it without DFlash to create a baseline. Then, we will run it with DFlash to compare the speed improvement.

Earn a Top AI Certification

Demonstrate you can effectively and responsibly use AI.

Get Certified, Get Hired

What Is DFlash?

In simple terms, DFlash uses a draft model to predict several tokens ahead, while the main model verifies those tokens instead of generating everything one token at a time. When many draft tokens are accepted, generation becomes much faster while keeping the output close to the original model.

In my experiment, DFlash delivered almost a 3.7x speedup on certain tasks, with outputs that were very similar to the baseline. The goal of this guide is to show the setup, run both versions, and compare the results clearly.

How DFlash Works

Standard LLM generation is slow because most models generate text one token at a time. Each token depends on the previous one, so the model has to move step by step through the response.

DFlash speeds this up using speculative decoding.

Instead of asking the main model to generate every token directly, DFlash uses a separate draft model to guess several upcoming tokens first. The main model then verifies those draft tokens in a larger step. If the draft tokens are good, the main model accepts them. If one of them is wrong, the main model corrects it and continues.

A simple way to think about it:

Without DFlash: the main model writes one token at a time.
With DFlash: the draft model suggests a block of tokens, and the main model quickly checks which ones it can accept.

Diagram of the DFlash speculative decoding workflow.

This is especially useful for structured tasks like programming. Code often follows predictable patterns such as imports, function definitions, indentation, loops, and common syntax. Because of this, the draft model can often guess the next tokens correctly, allowing the main model to accept more tokens in each step.

DFlash vs MTP: What Is the Difference?

DFlash and Multi-Token Prediction (MTP) both aim to solve the same problem: they help the model generate more than one token per expensive decoding step.

The difference is how they create the draft tokens.

Method	How It Works	Extra Model Needed?	Main Strength
MTP	Uses built-in multi-token prediction heads to predict future tokens	Usually no separate draft model	Simpler setup when the model already supports MTP
DFlash	Uses a separate DFlash draft model to propose larger blocks of tokens	Yes	Can achieve strong speedups on structured outputs like code

In simple terms, MTP is usually built into the model itself. It predicts multiple future tokens using internal prediction heads, so it can be easier to configure and more memory-efficient when supported.

DFlash, on the other hand, uses a separate draft model. This can make the setup slightly heavier, but it also allows more aggressive drafting. That is why DFlash can deliver large speedups on structured tasks where the next tokens are easier to predict.

1. Setting Up the Environment

I highly recommend running this setup locally if you have an RTX 3090 or RTX 4090 GPU. Otherwise, you can rent a GPU from RunPod, Vast.ai, or any other GPU provider.

For this guide, we will use a RunPod RTX 4090 pod. I started with the latest RunPod PyTorch template and made a few small changes:

Exposed port 8910 for the llama.cpp server
Increased persistent storage to 100 GB
Added my Hugging Face token to improve model download speed

With this setup, the pod costs around $0.70 per hour, depending on current RunPod pricing and availability.

Once the pod is deployed, open JupyterLab from the RunPod dashboard. Then launch a new terminal and install the basic dependencies:

apt update
apt install -y git cmake build-essential curl wget python3-pip

2. Clone BeeLlama.cpp

Next, we need to clone BeeLlama.cpp, the llama.cpp fork we will use for this setup.

BeeLlama.cpp is designed for faster local GGUF inference while keeping the familiar llama.cpp workflow. You still get the same style of tools, including llama-server, but with extra performance-focused features such as DFlash speculative decoding, adaptive draft control, and TurboQuant/TCQ KV-cache compression.

Run the following commands inside your JupyterLab terminal:

git clone https://github.com/Anbeeld/beellama.cpp.git
cd beellama.cpp

This will download the BeeLlama.cpp repository and move you into the project folder. All the build commands in the next step should be run from inside this directory.

3. Build BeeLlama.cpp with CUDA

Now we will build BeeLlama.cpp with CUDA support so it can use the RTX 4090 properly.

For this setup, we will enable CUDA, Flash Attention, native CPU optimizations, and quantized Flash Attention kernels. Since we are using an RTX 4090, we also set the CUDA architecture to 89.

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
 -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
 -DCMAKE_CUDA_ARCHITECTURES=89 \
 -DCMAKE_BUILD_TYPE=Release

cmake --build build -j

The build may take 20 minutes. During compilation, you might see warnings related to TurboQuant, TCQ, or DFlash CUDA declarations. In my case, these were just warnings and did not stop the build.

Finally, copy the server binary into the main project folder so it is easier to run later:

cp ./build/bin/llama-server ./llama-server

4. Install Hugging Face CLI and Download the Models

Now we need to download two GGUF files: the main model and the DFlash draft model.

The main model is the one that produces the final output. The DFlash draft model is much smaller and is used only to predict tokens ahead of the main model. The main model still verifies the generated tokens, so the draft model is there to speed up decoding rather than replace the main model.

First, install the Hugging Face CLI:

pip install -U huggingface_hub

Then create a folder to keep the model files organized:

mkdir -p models

Download the main Gemma 4 31B IT GGUF model:

hf download unsloth/gemma-4-31B-it-GGUF \
gemma-4-31B-it-Q4_K_S.gguf \
--local-dir models

Next, download the DFlash draft model:

hf download Anbeeld/gemma-4-31B-it-DFlash-GGUF \
gemma4-31b-it-dflash-Q5_K_M.gguf \
--local-dir models

The DFlash draft model is listed on Hugging Face as a dflash-draft architecture model, with the Q5_K_M file around 1.09GB, so it is much smaller than the main 31B model. This is what makes it practical to load alongside the main model for speculative decoding.

5. Run the Gemma 4 31B Without DFlash

Before enabling DFlash, we first need to run Gemma 4 31B normally. This gives us a baseline for generation speed, VRAM usage, and output quality. Later, we will compare this baseline with the DFlash run to see the actual speedup.

Run the following command from inside the beellama.cpp folder:

./llama-server \
 -m "models/gemma-4-31B-it-Q4_K_S.gguf"  \
--host 0.0.0.0  \
 --port 8910 \
-np 1 \
-ngl all \
-b 2048 -ub 512 \
--ctx-size 32768  \
--cache-type-k q5_0 \
--cache-type-v q4_1 \
--flash-attn on \
--jinja \
--metrics \
--log-timestamps \
--log-prefix \
--reasoning off \
--temp 0.7 \
--top-k 64 \
--top-p 0.95 \
--min-p 0.0

This command starts the model server on port 8910. Since we exposed port 8910 when creating the RunPod pod, we can access the model directly from the browser.

Once the model is loaded into GPU memory, you should see a message showing that the server is running on: 0.0.0.0:8910.

Now go back to your RunPod dashboard and click the port link associated with 8910.

This will open the llama.cpp web interface, where you can test the model in a simple chat-style UI.

At this point, try asking a few longer or more complex questions so you can observe the average token speed. In my baseline run without DFlash, I was getting around 41 tokens per second on average.

6. Evaluating the Baseline Model

Now that the baseline model is running, we need a simple way to measure its generation speed. For this, we will use three coding prompts and send them to the local llama.cpp server through the OpenAI-compatible chat completions endpoint.

The goal is not to create a perfect benchmark suite. We just want a consistent baseline so we can compare the same prompts later with DFlash enabled.

Launch a new Jupyter Terminal tab and create a test script:

cat > test_llm_prompts.sh <<'EOF'
#!/usr/bin/env bash

PORT="${1:-8910}"
MODEL="${2:-local-gemma}"
PREFIX="${3:-run}"

URL="http://localhost:${PORT}/v1/chat/completions"

PROMPTS=(
"Write a complete Python task store module. Include a Task dataclass, TaskStatus enum, TaskStore class, add_task, update_task, delete_task, search_tasks, filter_by_status, export_to_json, get_all_tasks, and 5 tests. Return only one complete Python file."

"Write a complete Python key-value report module. Include a KeyValueStore class, set, get, delete, exists, list_keys, filter_by_prefix, export_to_json, load_from_json, and a generate_report function that returns total keys, empty values, prefix counts, and largest value length. Include 5 tests. Return only one complete Python file."

"Write a complete Python doubly linked list module. Include a Node dataclass, DoublyLinkedList class, append, prepend, delete, find, reverse, to_list, from_list, clear, and 5 tests. Return only one complete Python file."
)

echo "Testing server: $URL"
echo "Model: $MODEL"
echo "Output prefix: $PREFIX"

for i in "${!PROMPTS[@]}"; do
  NUM=$((i+1))
  OUT="${PREFIX}_prompt_${NUM}.json"

  echo ""
  echo "Running prompt ${NUM}..."
  echo "Saving to ${OUT}"
  echo "--------------------------------"

  jq -n \
    --arg model "$MODEL" \
    --arg prompt "${PROMPTS[$i]}" \
    '{
      model: $model,
      messages: [
        {
          role: "user",
          content: $prompt
        }
      ],
      max_tokens: 1200,
      temperature: 0.7
    }' | curl -s "$URL" \
      -H "Content-Type: application/json" \
      -d @- | tee "$OUT" | jq '.timings'

  echo "Saved full result to ${OUT}"
done

echo ""
echo "Summary"
echo "--------------------------------"

for f in ${PREFIX}_prompt_*.json; do
  echo "$f"
  jq '{
    model: .model,
    prompt_tokens: .usage.prompt_tokens,
    completion_tokens: .usage.completion_tokens,
    total_tokens: .usage.total_tokens,
    generation_speed_tok_s: .timings.predicted_per_second,
    generation_time_sec: (.timings.predicted_ms / 1000),
    draft_tokens: .timings.draft_n,
    accepted_draft_tokens: .timings.draft_n_accepted
  }' "$f"
done
EOF

On macOS or Linux, remember to make the script executable:

chmod +x test_llm_prompts.sh

Then run it against the baseline model:

./test_llm_prompts.sh 8910 local-gemma-baseline baseline

This script sends three Python code-generation prompts to the model and saves each full response as a JSON file. It also prints useful timing information, including completion tokens, generation speed, generation time, and draft token fields.

The full output is quite long, so below is a short summary of the baseline results. This gives us a quick overview of how the model performs before enabling DFlash.

Prompt	Completion Tokens	Generation Speed	Generation Time
Prompt 1: Task store module	1124	40.66 tok/s	27.64 sec
Prompt 2: Key-value report module	1200	40.67 tok/s	29.51 sec
Prompt 3: Doubly linked list module	1200	40.72 tok/s	29.47 sec

Across all three prompts, the baseline model stayed very consistent at around 40.68 tokens per second. This gives us a clear reference point before testing the same prompts with DFlash enabled.

7. Run Gemma 4 31B with DFlash

Now that we have the baseline results, we can run the same model again with DFlash enabled.

Go back to the terminal where the baseline server is running and stop it with Ctrl + C.

Then start the optimized DFlash server:

./llama-server \
-m "models/gemma-4-31B-it-Q4_K_S.gguf" \
--spec-draft-model "models/gemma4-31b-it-dflash-Q5_K_M.gguf" \
--spec-type dflash \
--spec-dflash-cross-ctx 1024 \
--host 0.0.0.0  \
 --port 8910 \
-np 1 \
--kv-unified \
-ngl all \
--spec-draft-ngl all \
-b 2048 -ub 512 \
--ctx-size 32768 \
--flash-attn on \
--cache-ram 0 \
--jinja \
--no-mmap \
--mlock \
--no-host \
--metrics \
--log-timestamps \
--log-prefix \
--reasoning off \
--temp 0.7 \
--top-k 64 \
--top-p 0.95 \
--min-p 0.0

This command loads the same main Gemma 4 31B model, but now it also loads the DFlash draft model using --spec-draft-model.

The important DFlash-related flags are:

Flag	Purpose
`--spec-draft-model`	Loads the DFlash draft model
`--spec-type dflash`	Enables DFlash speculative decoding
`--spec-dflash-cross-ctx 1024`	Sets the cross-context window used by DFlash
`--spec-draft-ngl all`	Offloads the draft model layers to the GPU
`--kv-unified`	Uses unified KV handling for the main and draft model setup

It may take a little longer to start this time because both the main model and the DFlash draft model need to be loaded into memory.

Once the server is fully loaded, you should again see the inference server running on: 0.0.0.0:8910.

8. Evaluating the DFlash Model

Now go back to the Jupyter terminal where we have created the benchmark script. We can run the same script again, but this time against the DFlash-enabled server.

./test_llm_prompts.sh 8910 local-gemma-dflash dflash

This uses the same three coding prompts from the baseline test, which makes the comparison fair. The only major difference is that the server is now running with the DFlash draft model enabled.

Comparing inference speed

The full output is long, so here is a short summary of the baseline and DFlash results:

Prompt	Baseline speed	DFlash speed	Speedup	Baseline time	DFlash time	Time saved
Task store module	40.66 tok/s	130.96 tok/s	3.22x	27.64 sec	8.23 sec	19.41 sec
Key-value report module	40.67 tok/s	145.68 tok/s	3.58x	29.51 sec	8.24 sec	21.27 sec
Doubly linked list module	40.72 tok/s	153.04 tok/s	3.76x	29.47 sec	7.84 sec	21.63 sec

Across these three coding tasks, DFlash increased generation speed from around 40 tok/s to 130–153 tok/s. That gives us roughly a 3.2x to 3.8x speedup, while reducing generation time from almost 30 seconds to around 8 seconds per prompt.

You can also open the same 8910 port link from the RunPod dashboard and test the model through the web UI.

Comparing output quality

Since we are getting close to a 4x speedup on coding prompts, the next thing to check is output quality. For that, I tested the model on a few different tasks.

First, I asked it to generate a simple portfolio website for “Abid.” For a local 31B model running on a single RTX 4090, the result was impressive. It produced a clean structure with usable HTML and styling.

Next, I asked it to generate a diagram for a complete MLOps pipeline. The model returned Mermaid code with labels, colors, and a complete workflow. I tested the code, and it worked right out of the box.

Then I asked it to write a blog on Mixture of Experts in LLMs. The quality was still strong, but the speed dropped to around 95 tok/s. This is still much faster than the baseline, but slower than the coding prompts.

This makes sense because DFlash works best when the output is more predictable. Coding tasks often follow clear patterns, so the draft model can guess more tokens correctly. Creative writing or research-style prompts are less predictable, so the model may accept fewer draft tokens and the speedup can be lower.

Final Thoughts

After testing this setup, I think speculative decoding combined with better KV-cache handling is the real winner for local LLM inference.

The biggest benefit is not just the speedup on paper. It is what that speed unlocks. When a 31B model can generate code at 130–150 tokens per second on a single RTX 4090, it starts to feel practical as a local coding agent. You can use it to build projects from scratch, connect it with MCP servers, run bash tools, use custom skills, and create a workflow that feels much closer to premium coding agents.

For people who already have an RTX 3090 or 4090, this is even more exciting. Instead of paying for every coding assistant or relying completely on cloud tools, you can run a powerful local setup that is fast, private, and flexible. It may not replace every hosted tool for everyone, but for local AI enthusiasts, developers, and builders, it is getting very close.

I also think this is just the start. Many people are already testing similar setups with newer models like Qwen3.6-27B and reporting even better quality. As the models improve, draft models get better, and inference engines like BeeLlama.cpp become more optimized, local AI will only become more useful.

The best part is the community around it. A lot of these improvements are coming from local AI enthusiasts who are experimenting, benchmarking, improving the tools, and sharing their results openly. That makes it easier for the rest of us to replicate the setup and experience the same performance gains.

Author

Abid Ali Awan

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Topics

Large Language Models

Artificial Intelligence

Top AI Courses

Track

Associate AI Engineer for Developers

29 hr

Learn how to integrate AI into software applications using APIs and open-source libraries. Start your journey to becoming an AI Engineer today!

See Details

Start Course

Course

Working with Hugging Face

2 hr

34.9K

Navigate and use the extensive repository of models and datasets available on the Hugging Face Hub.

See Details

Start Course

Course

Transformer Models with PyTorch

2 hr

8.3K

What makes LLMs tick? Discover how transformers revolutionized text modeling and kickstarted the generative AI boom.

See Details

Start Course

Tutorial

How to Run GLM 4.7 Flash Locally

Learn how to run GLM-4.7-Flash on an RTX 3090 for fast local inference and integrating with OpenCode to build a fully local automated AI coding agent.

Abid Ali Awan

Tutorial

How to Run DeepSeek V4 Flash Locally

Learn how to run the full DeepSeek V4 Flash model on a single GPU using a modified llama.cpp build and a compatible GGUF file in this hands-on tutorial.

Abid Ali Awan

Tutorial

Multi-Token Prediction Tutorial: How To Speed Up LLMs

Run Qwen3.6 27B on an RTX 3090 and learn how Multi-Token Prediction (MTP) with llama.cpp can boost local LLM inference by almost 2x without upgrading your GPU.

Abid Ali Awan

Tutorial

Speculative Decoding: A Guide With Implementation Examples

Learn what speculative decoding is, how it works, when to use it, and how to implement it using Gemma2 models.

Aashi Dutt

Tutorial

Fine Tuning Google Gemma: Enhancing LLMs with Customized Instructions

Learn how to run inference on GPUs/TPUs and fine-tune the latest Gemma 7b-it model on a role-play dataset.

Abid Ali Awan

Tutorial

Fine-Tune and Run Inference on Google's Gemma Model Using TPUs for Enhanced Speed and Performance

Learn to infer and fine-tune LLMs with TPUs and implement model parallelism for distributed training on 8 TPU devices.

Abid Ali Awan

See More See More

Earn a Top AI Certification

What Is DFlash?

How DFlash Works

DFlash vs MTP: What Is the Difference?

1. Setting Up the Environment

2. Clone BeeLlama.cpp

3. Build BeeLlama.cpp with CUDA

4. Install Hugging Face CLI and Download the Models

5. Run the Gemma 4 31B Without DFlash

6. Evaluating the Baseline Model

7. Run Gemma 4 31B with DFlash

8. Evaluating the DFlash Model

Comparing inference speed

Comparing output quality

Final Thoughts

How to Run GLM 4.7 Flash Locally

How to Run DeepSeek V4 Flash Locally

Multi-Token Prediction Tutorial: How To Speed Up LLMs

Speculative Decoding: A Guide With Implementation Examples

Fine Tuning Google Gemma: Enhancing LLMs with Customized Instructions

Fine-Tune and Run Inference on Google's Gemma Model Using TPUs for Enhanced Speed and Performance

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Associate AI Engineer for Developers

Working with Hugging Face

Transformer Models with PyTorch

How to Run GLM 4.7 Flash Locally

How to Run DeepSeek V4 Flash Locally

Multi-Token Prediction Tutorial: How To Speed Up LLMs

Speculative Decoding: A Guide With Implementation Examples

Fine Tuning Google Gemma: Enhancing LLMs with Customized Instructions

Fine-Tune and Run Inference on Google's Gemma Model Using TPUs for Enhanced Speed and Performance

Associate AI Engineer for Developers