Skip to main content

How to Speed Up Local LLMs with DFlash Speculative Decoding

Learn how to accelerate local Gemma 4 31B inference on a single RTX 4090 using DFlash speculative decoding and Flash Attention against a baseline setup.
Jun 17, 2026  · 11 min read

Over the past few weeks, I have been seeing a lot of excitement in the r/LocalLLaMA community around speculative decoding, DFlash, better KV cache handling, and optimized llama.cpp builds. The interesting part is that people are getting major speedups without upgrading their hardware.

In this guide, we will run Gemma 4 31B IT locally on an RTX 4090 24GB using BeeLlama.cpp, a fork of llama.cpp that supports DFlash speculative decoding.

We will test the model in two ways. First, we will run it without DFlash to create a baseline. Then, we will run it with DFlash to compare the speed improvement.

Earn a Top AI Certification

Demonstrate you can effectively and responsibly use AI.
Get Certified, Get Hired

What Is DFlash?

In simple terms, DFlash uses a draft model to predict several tokens ahead, while the main model verifies those tokens instead of generating everything one token at a time. When many draft tokens are accepted, generation becomes much faster while keeping the output close to the original model.

In my experiment, DFlash delivered almost a 3.7x speedup on certain tasks, with outputs that were very similar to the baseline. The goal of this guide is to show the setup, run both versions, and compare the results clearly.

How DFlash Works

Standard LLM generation is slow because most models generate text one token at a time. Each token depends on the previous one, so the model has to move step by step through the response.

DFlash speeds this up using speculative decoding.

Instead of asking the main model to generate every token directly, DFlash uses a separate draft model to guess several upcoming tokens first. The main model then verifies those draft tokens in a larger step. If the draft tokens are good, the main model accepts them. If one of them is wrong, the main model corrects it and continues.

A simple way to think about it:

  • Without DFlash: the main model writes one token at a time.
  • With DFlash: the draft model suggests a block of tokens, and the main model quickly checks which ones it can accept.

Diagram of the DFlash speculative decoding workflow.

Diagram of the DFlash speculative decoding workflow.

This is especially useful for structured tasks like programming. Code often follows predictable patterns such as imports, function definitions, indentation, loops, and common syntax. Because of this, the draft model can often guess the next tokens correctly, allowing the main model to accept more tokens in each step.

DFlash vs MTP: What Is the Difference?

DFlash and Multi-Token Prediction (MTP) both aim to solve the same problem: they help the model generate more than one token per expensive decoding step.

The difference is how they create the draft tokens.

Method

How It Works

Extra Model Needed?

Main Strength

MTP

Uses built-in multi-token prediction heads to predict future tokens

Usually no separate draft model

Simpler setup when the model already supports MTP

DFlash

Uses a separate DFlash draft model to propose larger blocks of tokens

Yes

Can achieve strong speedups on structured outputs like code

In simple terms, MTP is usually built into the model itself. It predicts multiple future tokens using internal prediction heads, so it can be easier to configure and more memory-efficient when supported.

DFlash, on the other hand, uses a separate draft model. This can make the setup slightly heavier, but it also allows more aggressive drafting. That is why DFlash can deliver large speedups on structured tasks where the next tokens are easier to predict.

1. Setting Up the Environment

I highly recommend running this setup locally if you have an RTX 3090 or RTX 4090 GPU. Otherwise, you can rent a GPU from RunPod, Vast.ai, or any other GPU provider.

For this guide, we will use a RunPod RTX 4090 pod. I started with the latest RunPod PyTorch template and made a few small changes:

  • Exposed port 8910 for the llama.cpp server
  • Increased persistent storage to 100 GB
  • Added my Hugging Face token to improve model download speed

Editing the latest Pytorch template in the Runpod

With this setup, the pod costs around $0.70 per hour, depending on current RunPod pricing and availability.

Runpod Pytorch docker running on the RTC 4090 machine.

Once the pod is deployed, open JupyterLab from the RunPod dashboard. Then launch a new terminal and install the basic dependencies:

apt update
apt install -y git cmake build-essential curl wget python3-pip

Installing Linux pages within the Jupyter Terminal.

2. Clone BeeLlama.cpp

Next, we need to clone BeeLlama.cpp, the llama.cpp fork we will use for this setup.

BeeLlama.cpp is designed for faster local GGUF inference while keeping the familiar llama.cpp workflow. You still get the same style of tools, including llama-server, but with extra performance-focused features such as DFlash speculative decoding, adaptive draft control, and TurboQuant/TCQ KV-cache compression

Run the following commands inside your JupyterLab terminal:

git clone https://github.com/Anbeeld/beellama.cpp.git
cd beellama.cpp

This will download the BeeLlama.cpp repository and move you into the project folder. All the build commands in the next step should be run from inside this directory.

3. Build BeeLlama.cpp with CUDA

Now we will build BeeLlama.cpp with CUDA support so it can use the RTX 4090 properly.

For this setup, we will enable CUDA, Flash Attention, native CPU optimizations, and quantized Flash Attention kernels. Since we are using an RTX 4090, we also set the CUDA architecture to 89.

cmake -B build -DGGML_CUDA=ON -DGGML_NATIVE=ON \
 -DGGML_CUDA_FA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON \
 -DCMAKE_CUDA_ARCHITECTURES=89 \
 -DCMAKE_BUILD_TYPE=Release

cmake --build build -j

The build may take 20 minutes. During compilation, you might see warnings related to TurboQuant, TCQ, or DFlash CUDA declarations. In my case, these were just warnings and did not stop the build.

BeeLlama.cpp with CUDA build is complete

Finally, copy the server binary into the main project folder so it is easier to run later: 

cp ./build/bin/llama-server ./llama-server

4. Install Hugging Face CLI and Download the Models

Now we need to download two GGUF files: the main model and the DFlash draft model.

The main model is the one that produces the final output. The DFlash draft model is much smaller and is used only to predict tokens ahead of the main model. The main model still verifies the generated tokens, so the draft model is there to speed up decoding rather than replace the main model.

First, install the Hugging Face CLI:

pip install -U huggingface_hub

Then create a folder to keep the model files organized:

mkdir -p models

Download the main Gemma 4 31B IT GGUF model:

hf download unsloth/gemma-4-31B-it-GGUF \
gemma-4-31B-it-Q4_K_S.gguf \
--local-dir models

Next, download the DFlash draft model:

hf download Anbeeld/gemma-4-31B-it-DFlash-GGUF \
gemma4-31b-it-dflash-Q5_K_M.gguf \
--local-dir models

The DFlash draft model is listed on Hugging Face as a dflash-draft architecture model, with the Q5_K_M file around 1.09GB, so it is much smaller than the main 31B model. This is what makes it practical to load alongside the main model for speculative decoding. 

5. Run the Gemma 4 31B Without DFlash

Before enabling DFlash, we first need to run Gemma 4 31B normally. This gives us a baseline for generation speed, VRAM usage, and output quality. Later, we will compare this baseline with the DFlash run to see the actual speedup.

Run the following command from inside the beellama.cpp folder:

./llama-server \
 -m "models/gemma-4-31B-it-Q4_K_S.gguf"  \
--host 0.0.0.0  \
 --port 8910 \
-np 1 \
-ngl all \
-b 2048 -ub 512 \
--ctx-size 32768  \
--cache-type-k q5_0 \
--cache-type-v q4_1 \
--flash-attn on \
--jinja \
--metrics \
--log-timestamps \
--log-prefix \
--reasoning off \
--temp 0.7 \
--top-k 64 \
--top-p 0.95 \
--min-p 0.0

This command starts the model server on port 8910. Since we exposed port 8910 when creating the RunPod pod, we can access the model directly from the browser.

Once the model is loaded into GPU memory, you should see a message showing that the server is running on: 0.0.0.0:8910.

The Gemma 4 31B Without DFlash inference serve is running locally.

Now go back to your RunPod dashboard and click the port link associated with 8910. 

Access the 8910 port from the Runpod Dashboard

This will open the llama.cpp web interface, where you can test the model in a simple chat-style UI. 

Testing the Gemma 4 31B Without DFlash

At this point, try asking a few longer or more complex questions so you can observe the average token speed. In my baseline run without DFlash, I was getting around 41 tokens per second on average. 

Testing the Gemma 4 31B Without DFlash

6. Evaluating the Baseline Model

Now that the baseline model is running, we need a simple way to measure its generation speed. For this, we will use three coding prompts and send them to the local llama.cpp server through the OpenAI-compatible chat completions endpoint.

The goal is not to create a perfect benchmark suite. We just want a consistent baseline so we can compare the same prompts later with DFlash enabled.

Launch a new Jupyter Terminal tab and create a test script:

cat > test_llm_prompts.sh <<'EOF'
#!/usr/bin/env bash

PORT="${1:-8910}"
MODEL="${2:-local-gemma}"
PREFIX="${3:-run}"

URL="http://localhost:${PORT}/v1/chat/completions"

PROMPTS=(
"Write a complete Python task store module. Include a Task dataclass, TaskStatus enum, TaskStore class, add_task, update_task, delete_task, search_tasks, filter_by_status, export_to_json, get_all_tasks, and 5 tests. Return only one complete Python file."

"Write a complete Python key-value report module. Include a KeyValueStore class, set, get, delete, exists, list_keys, filter_by_prefix, export_to_json, load_from_json, and a generate_report function that returns total keys, empty values, prefix counts, and largest value length. Include 5 tests. Return only one complete Python file."

"Write a complete Python doubly linked list module. Include a Node dataclass, DoublyLinkedList class, append, prepend, delete, find, reverse, to_list, from_list, clear, and 5 tests. Return only one complete Python file."
)

echo "Testing server: $URL"
echo "Model: $MODEL"
echo "Output prefix: $PREFIX"

for i in "${!PROMPTS[@]}"; do
  NUM=$((i+1))
  OUT="${PREFIX}_prompt_${NUM}.json"

  echo ""
  echo "Running prompt ${NUM}..."
  echo "Saving to ${OUT}"
  echo "--------------------------------"

  jq -n \
    --arg model "$MODEL" \
    --arg prompt "${PROMPTS[$i]}" \
    '{
      model: $model,
      messages: [
        {
          role: "user",
          content: $prompt
        }
      ],
      max_tokens: 1200,
      temperature: 0.7
    }' | curl -s "$URL" \
      -H "Content-Type: application/json" \
      -d @- | tee "$OUT" | jq '.timings'

  echo "Saved full result to ${OUT}"
done

echo ""
echo "Summary"
echo "--------------------------------"

for f in ${PREFIX}_prompt_*.json; do
  echo "$f"
  jq '{
    model: .model,
    prompt_tokens: .usage.prompt_tokens,
    completion_tokens: .usage.completion_tokens,
    total_tokens: .usage.total_tokens,
    generation_speed_tok_s: .timings.predicted_per_second,
    generation_time_sec: (.timings.predicted_ms / 1000),
    draft_tokens: .timings.draft_n,
    accepted_draft_tokens: .timings.draft_n_accepted
  }' "$f"
done
EOF

On macOS or Linux, remember to make the script executable:

chmod +x test_llm_prompts.sh

Then run it against the baseline model: 

./test_llm_prompts.sh 8910 local-gemma-baseline baseline

This script sends three Python code-generation prompts to the model and saves each full response as a JSON file. It also prints useful timing information, including completion tokens, generation speed, generation time, and draft token fields. 

The full output is quite long, so below is a short summary of the baseline results. This gives us a quick overview of how the model performs before enabling DFlash.

Prompt

Completion Tokens

Generation Speed

Generation Time

Prompt 1: Task store module

1124

40.66 tok/s

27.64 sec

Prompt 2: Key-value report module

1200

40.67 tok/s

29.51 sec

Prompt 3: Doubly linked list module

1200

40.72 tok/s

29.47 sec

Across all three prompts, the baseline model stayed very consistent at around 40.68 tokens per second. This gives us a clear reference point before testing the same prompts with DFlash enabled.

7. Run Gemma 4 31B with DFlash

Now that we have the baseline results, we can run the same model again with DFlash enabled.

Go back to the terminal where the baseline server is running and stop it with Ctrl + C

Then start the optimized DFlash server:

./llama-server \
-m "models/gemma-4-31B-it-Q4_K_S.gguf" \
--spec-draft-model "models/gemma4-31b-it-dflash-Q5_K_M.gguf" \
--spec-type dflash \
--spec-dflash-cross-ctx 1024 \
--host 0.0.0.0  \
 --port 8910 \
-np 1 \
--kv-unified \
-ngl all \
--spec-draft-ngl all \
-b 2048 -ub 512 \
--ctx-size 32768 \
--flash-attn on \
--cache-ram 0 \
--jinja \
--no-mmap \
--mlock \
--no-host \
--metrics \
--log-timestamps \
--log-prefix \
--reasoning off \
--temp 0.7 \
--top-k 64 \
--top-p 0.95 \
--min-p 0.0

This command loads the same main Gemma 4 31B model, but now it also loads the DFlash draft model using --spec-draft-model.

The important DFlash-related flags are:

Flag

Purpose

--spec-draft-model

Loads the DFlash draft model

--spec-type dflash

Enables DFlash speculative decoding

--spec-dflash-cross-ctx 1024

Sets the cross-context window used by DFlash

--spec-draft-ngl all

Offloads the draft model layers to the GPU

--kv-unified

Uses unified KV handling for the main and draft model setup

It may take a little longer to start this time because both the main model and the DFlash draft model need to be loaded into memory.

Loading the Gemma 4 31B With DFlash in the GPU memory

Once the server is fully loaded, you should again see the inference server running on: 0.0.0.0:8910.

Running the Gemma 4 31B With DFlash locally

8. Evaluating the DFlash Model

Now go back to the Jupyter terminal where we have created the benchmark script. We can run the same script again, but this time against the DFlash-enabled server. 

./test_llm_prompts.sh 8910 local-gemma-dflash dflash

This uses the same three coding prompts from the baseline test, which makes the comparison fair. The only major difference is that the server is now running with the DFlash draft model enabled.

Comparing inference speed

The full output is long, so here is a short summary of the baseline and DFlash results:

Prompt

Baseline speed

DFlash speed

Speedup

Baseline time

DFlash time

Time saved

Task store module

40.66 tok/s

130.96 tok/s

3.22x

27.64 sec

8.23 sec

19.41 sec

Key-value report module

40.67 tok/s

145.68 tok/s

3.58x

29.51 sec

8.24 sec

21.27 sec

Doubly linked list module

40.72 tok/s

153.04 tok/s

3.76x

29.47 sec

7.84 sec

21.63 sec

Across these three coding tasks, DFlash increased generation speed from around 40 tok/s to 130–153 tok/s. That gives us roughly a 3.2x to 3.8x speedup, while reducing generation time from almost 30 seconds to around 8 seconds per prompt. 

You can also open the same 8910 port link from the RunPod dashboard and test the model through the web UI.

Comparing output quality

Since we are getting close to a 4x speedup on coding prompts, the next thing to check is output quality. For that, I tested the model on a few different tasks.

First, I asked it to generate a simple portfolio website for “Abid.” For a local 31B model running on a single RTX 4090, the result was impressive. It produced a clean structure with usable HTML and styling. 

Website generated by Gemma 4

Next, I asked it to generate a diagram for a complete MLOps pipeline. The model returned Mermaid code with labels, colors, and a complete workflow. I tested the code, and it worked right out of the box. 

Testing the Gemma 4 31B With DFlash for the diagram generation.

Then I asked it to write a blog on Mixture of Experts in LLMs. The quality was still strong, but the speed dropped to around 95 tok/s. This is still much faster than the baseline, but slower than the coding prompts. 

Testing the Gemma 4 31B With DFlash for blog writing.

This makes sense because DFlash works best when the output is more predictable. Coding tasks often follow clear patterns, so the draft model can guess more tokens correctly. Creative writing or research-style prompts are less predictable, so the model may accept fewer draft tokens and the speedup can be lower. 

Final Thoughts

After testing this setup, I think speculative decoding combined with better KV-cache handling is the real winner for local LLM inference.

The biggest benefit is not just the speedup on paper. It is what that speed unlocks. When a 31B model can generate code at 130–150 tokens per second on a single RTX 4090, it starts to feel practical as a local coding agent. You can use it to build projects from scratch, connect it with MCP servers, run bash tools, use custom skills, and create a workflow that feels much closer to premium coding agents.

For people who already have an RTX 3090 or 4090, this is even more exciting. Instead of paying for every coding assistant or relying completely on cloud tools, you can run a powerful local setup that is fast, private, and flexible. It may not replace every hosted tool for everyone, but for local AI enthusiasts, developers, and builders, it is getting very close.

I also think this is just the start. Many people are already testing similar setups with newer models like Qwen3.6-27B and reporting even better quality. As the models improve, draft models get better, and inference engines like BeeLlama.cpp become more optimized, local AI will only become more useful.

The best part is the community around it. A lot of these improvements are coming from local AI enthusiasts who are experimenting, benchmarking, improving the tools, and sharing their results openly. That makes it easier for the rest of us to replicate the setup and experience the same performance gains.


Abid Ali Awan's photo
Author
Abid Ali Awan
LinkedIn
Twitter

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Topics

Top AI Courses

Track

Associate AI Engineer for Developers

26 hr
Learn how to integrate AI into software applications using APIs and open-source libraries. Start your journey to becoming an AI Engineer today!
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Tutorial

How to Run GLM 4.7 Flash Locally

Learn how to run GLM-4.7-Flash on an RTX 3090 for fast local inference and integrating with OpenCode to build a fully local automated AI coding agent.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

How to Run DeepSeek V4 Flash Locally

Learn how to run the full DeepSeek V4 Flash model on a single GPU using a modified llama.cpp build and a compatible GGUF file in this hands-on tutorial.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Multi-Token Prediction Tutorial: How To Speed Up LLMs

Run Qwen3.6 27B on an RTX 3090 and learn how Multi-Token Prediction (MTP) with llama.cpp can boost local LLM inference by almost 2x without upgrading your GPU.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Speculative Decoding: A Guide With Implementation Examples

Learn what speculative decoding is, how it works, when to use it, and how to implement it using Gemma2 models.
Aashi Dutt's photo

Aashi Dutt

Tutorial

Fine Tuning Google Gemma: Enhancing LLMs with Customized Instructions

Learn how to run inference on GPUs/TPUs and fine-tune the latest Gemma 7b-it model on a role-play dataset.
Abid Ali Awan's photo

Abid Ali Awan

Tutorial

Fine-Tune and Run Inference on Google's Gemma Model Using TPUs for Enhanced Speed and Performance

Learn to infer and fine-tune LLMs with TPUs and implement model parallelism for distributed training on 8 TPU devices.
Abid Ali Awan's photo

Abid Ali Awan

See MoreSee More