How to Run DeepSeek V4 Flash Locally

Learn how to run the full DeepSeek V4 Flash model on a single GPU using a modified llama.cpp build and a compatible GGUF file in this hands-on tutorial.

May 5, 2026 · 9 min read

DeepSeek V4 Flash is the smaller, faster, and more cost-efficient model in the DeepSeek V4 preview series. It is designed for practical inference workloads, with lower active parameters than DeepSeek V4 Pro and support for long-context tasks. The GGUF version used in this guide stores dense weights in FP8 and MoE expert weights in FP4, making it suitable for local inference through a custom llama.cpp build.

In this guide, we will run DeepSeek V4 Flash locally on RunPod using an RTX PRO 6000 GPU and a modified llama.cpp build. You will learn how to set up the GPU pod, install the required dependencies, compile llama.cpp with DeepSeek V4 support, download the FP4/FP8 GGUF model from Hugging Face, and serve it through the browser-based llama.cpp Web UI.

Before you begin, make sure you have:

A RunPod account
At least $5 in RunPod credit
Basic familiarity with Linux terminal commands
A Hugging Face account
A Hugging Face access token saved as HF_TOKEN

You will use the Hugging Face token to download the model faster and more reliably.

If you want to see how the model stacks up against its proprietary competitors from OpenAI, I recommend reading our DeepSeek V4 Flash vs GPT-5.4 Mini and Nano comparison guide.

Step 1: Set Up the RunPod Environment

First, create a new GPU pod on RunPod.

For this guide, we are using the RTX PRO 6000 GPU because it offers 96GB of VRAM at a much lower cost than an H100. This makes it a practical option for running the full DeepSeek V4 Flash model on a single GPU without paying premium H100 pricing.

In the RunPod dashboard, select an RTX PRO 6000 GPU pod and use the latest PyTorch template as the base image.

Before deploying the pod, edit the template settings and configure the storage, exposed port, and environment variables.

Use the following recommended setup:

Setting	Recommended Value
GPU	RTX PRO 6000
Container Disk	50 GB
Volume Disk	300 GB
Exposed Port	8910
Template	Latest PyTorch template
Environment Variable	`HF_TOKEN`

The exposed port 8910 is important because this is the port you will use to access the llama.cpp Web UI from your browser.

Once the pod is deployed, wait a few seconds for the RunPod dashboard to show the JupyterLab link.

Open JupyterLab, then launch a terminal. To confirm that the GPU is available, run:

nvidia-smi

This should display information about the GPU, memory, CUDA version, and driver version.

Next, install the system dependencies required to build and run llama.cpp.

apt-get update

apt-get install -y \
 pciutils \
 build-essential \
 cmake \
 git \
 curl \
 wget \
 libcurl4-openssl-dev \
 tmux \
 python3 \
 python3-pip \
 Python3-venv

These packages include build tools, CMake, Git, Python, and other utilities needed to compile llama.cpp from source.

Step 2: Install the Modified llama.cpp Build

DeepSeek V4 Flash is still very new, so local support is not as straightforward as older models. At the time of writing, there is no widely adopted official GGUF release from major community providers such as Unsloth for running the full model through standard upstream llama.cpp.

The official DeepSeek V4 Flash model is available on Hugging Face, but the local GGUF route still depends on community conversions and experimental runtime support. The GGUF used in this guide specifically states that the stock upstream llama.cpp cannot load it and requires a work-in-progress build with DeepSeek V4 Flash architecture support, native FP8, and MXFP4 support.

Because of that, this setup uses an open-source contributor’s modified llama.cpp branch rather than the standard upstream version. This is currently the practical path for testing the full DeepSeek V4 Flash GGUF locally.

The upstream llama.cpp project also has an open model request for DeepSeek V4 support, which shows that official support is still being worked through rather than fully merged into the main project.

Move into the workspace directory:

cd /workspace

Clone the modified repository:

git clone -b wip/deepseek-v4-support https://github.com/nisparks/llama.cpp.git llama.cpp-deepseek-v4

Now configure the build using CMake:

cmake llama.cpp-deepseek-v4 \
 -B llama.cpp-deepseek-v4/build \
 -DBUILD_SHARED_LIBS=OFF \
 -DGGML_CUDA=ON \
 -DCMAKE_BUILD_TYPE=Release

This enables CUDA support, so the model can use GPU acceleration.

Build the required binaries:

cmake --build llama.cpp-deepseek-v4/build \
 --config Release \
 -j \
 --clean-first \
 --target llama-cli llama-server llama-gguf-split

After the build finishes, copy the binaries into the main project folder:

cp llama.cpp-deepseek-v4/build/bin/llama-* llama.cpp-deepseek-v4/

Finally, check that the server binary works:

llama.cpp-deepseek-v4/llama-server --help

If the help menu appears, the build was successful.

Step 3: Download the DeepSeek V4 Flash Model

Next, install the Hugging Face download tools. This is where the HF_TOKEN you added earlier becomes important. Since this is a large model file, logging in with your Hugging Face token improves download reliability and gives you access to faster download methods.

Install the required packages:

pip install -U "huggingface_hub[hf_xet]" hf-xet hf_transfer

Enable faster Hugging Face downloads:

export HF_HUB_ENABLE_HF_TRANSFER=1

Create a folder for the model:

mkdir -p /workspace/models/deepseek-v4-flash-fp4-fp8

Download the GGUF model file:

hf download nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF \
 DeepSeek-V4-Flash-FP4-FP8-native.gguf \
 --local-dir /workspace/models/deepseek-v4-flash-fp4-fp8

With hf_transfer enabled and your HF_TOKEN already set in the RunPod environment, the model download can reach very high speeds.

In this setup, the download reached almost 2 GB per second, which makes downloading a large GGUF file much more practical. This speed is only possible when your Hugging Face token is configured properly, and the pod can authenticate with Hugging Face.

Once the download is complete, verify the file:

ls -lh /workspace/models/deepseek-v4-flash-fp4-fp8

You should see a file similar to this:

total 146G
-rw-rw-rw- 1 root root 146G May  3 18:27 DeepSeek-V4-Flash-FP4-FP8-native.gguf

Step 4: Serve DeepSeek V4 Flash with llama.cpp

Now that the model is downloaded and the modified llama.cpp build is ready, the next step is to start the local inference server so you can access DeepSeek V4 Flash through the browser-based Web UI and API endpoint.

Move into the llama.cpp directory:

cd /workspace/llama.cpp-deepseek-v4

Start the model server:

./llama-server \
 --model /workspace/models/deepseek-v4-flash-fp4-fp8/DeepSeek-V4-Flash-FP4-FP8-native.gguf \
 --alias "DeepSeek-V4-Flash" \
 --host 0.0.0.0 \
 --port 8910 \
 --jinja \
 --fit on \
 --threads 16 \
 --threads-batch 16 \
 --ctx-size 32768 \
 --batch-size 2048 \
 --ubatch-size 512 \
 --flash-attn on \
 --temp 0.7 \
 --top-p 0.95 \
 --cont-batching \
 --metrics \
 --perf

This command loads the GGUF model, exposes the server on 0.0.0.0:8910, applies the Jinja chat template, uses --fit on to fit the model into the available GPU and system memory, sets a 32K context window, enables CUDA-friendly batching and Flash Attention for faster inference, and turns on metrics and performance logging so you can monitor the run.

The model may take at least a minute to load into the GPU and CPU memory.

When the server is ready, you should see a message showing that it is “listening on http://0.0.0.0:8910”.

This means the model server is running and ready to receive requests.

Go back to your RunPod dashboard. Look for the exposed port 8910, then click the port link.

This will open the llama.cpp Web UI in your browser. The interface looks similar to a basic ChatGPT-style chat interface.

Once the page opens, the model should already be loaded. You can start chatting with it directly from the browser.

Step 5: Testing DeepSeek V4 Flash Locally

After the server is running, you can test the model using different types of prompts.

The goal is to check how well it performs across:

UI generation
Writing and explanation
Math reasoning
Full project generation

Test 1: UI and web page generation

Use the following prompt:

Build a simple, single-screen HTML landing page for a fictional company called NovaGrid AI, with a centered headline, one short paragraph, three feature cards, and a "Get Started" button, using clean modern styling with no scrolling.

In this test, the model generated the HTML page in about 2 minutes, which is a reasonable time.

To preview the generated page, look for the eye icon near the code output in the Web UI. Click it to open the rendered web page.

The page worked, but the visual quality was not very impressive. The layout was functional, but the design felt basic. Smaller models can sometimes produce more polished frontend outputs, so this result was underwhelming for UI generation.

Test 2: Writing and Explanation

Next, test the model’s writing ability.

Use this prompt:

Write an 800-word report on Agentic Skills, explaining what they are, why they matter for AI agents, key examples such as tool use, planning, memory, reflection, and task execution, and how they can help businesses automate complex workflows.

The model produced a clear and well-structured report. It explained the main ideas in a simple way and included useful examples of tool use, planning, memory, reflection, and business automation.

However, the output felt slightly generic and promotional in some places, especially near the conclusion. It also included several formatting and spelling issues, such as inconsistent bolding and wording errors like “Mainate Context.”

Test 3: Math and reasoning

Now test the model’s reasoning ability with a simple algebra problem.

Use this prompt:

Solve the following math problem step by step. Show your reasoning clearly, check your work, and provide the final answer in a boxed format.
Problem:
A small online store sells notebooks and pens. A notebook costs $4 more than a pen. On Monday, the store sold 12 notebooks and 30 pens for a total of $156. What is the price of one notebook and one pen?

The model solved the problem correctly.

It defined the variables properly, created the correct equations, substituted values correctly, and checked the final answer.

The exact answer was:

Pen = 18/7 dollars
Notebook = 46/7 dollars

As decimals, this is approximately:

Pen ≈ $2.57
Notebook ≈ $6.57

The values correctly add up to the total of $156.

Test 4: Full Python project generation

Finally, test whether the model can generate a complete beginner-friendly coding project.

Use this prompt:

Create a complete beginner-friendly Python project called Expense Tracker CLI.

Requirements:
- Use only Python standard libraries.
- Create a command-line app where users can add expenses, view all expenses, filter expenses by category, and see the total spending.
- Store expenses in a local JSON file called expenses.json.
- Include a clear file structure.
- Provide the full code for each file.
- Add comments where helpful.
- Include setup instructions and example commands to run the app.
- Keep the code clean, simple, and easy to understand.

The response looked complete at first, and the project structure made sense. However, the generated code had several serious issues.

The output included:

Broken function names
Spelling errors in variables
Invalid Python syntax
Broken f-strings
Inconsistent file names
Code that would not run without manual debugging

For a beginner-friendly project, this is a major problem. A beginner should be able to copy, run, and understand the code with minimal fixes. In this case, the generated project would need significant debugging before it could be used.

Overall evaluation of the local DeepSeek V4 Flash

After testing DeepSeek V4 Flash on UI generation, writing, reasoning, and project generation, the model showed mixed results.

It performed well on structured reasoning and basic explanatory writing. It was also able to generate outputs quickly through the llama.cpp Web UI.

However, it struggled with polished frontend design and reliable full-project code generation. The Python project output looked complete but contained too many syntax and naming errors to be useful without manual debugging.

Task	Performance
UI generation	Average
Writing and explanation	Good
Math reasoning	Strong
Full project generation	Weak
Speed	Good
Overall reliability	Mixed

Final Thoughts

Running DeepSeek V4 Flash locally was honestly a nightmare.

I first tried running it on a 4x H100 setup using an sglang Docker Compose configuration, but it still failed. I then tried running it with vLLM on 4x H100 RunPod using Python, but that also failed. The error kept pointing to DeepSeek V4 support in the latest version of transformers, even though I was already using the latest version. This made it clear that proper framework support is still not fully there.

Even the official Hugging Face model page does not provide a simple, standard inference example. Instead, it points users toward a custom torchrun approach, which is much heavier and takes more work to set up.

I also tested community-provided GGUF files, but ran into llama.cpp compatibility issues. Usually, I prefer using Unsloth GGUF files because they are fast, reliable, and easy to run, but for DeepSeek V4 Flash, there was no simple plug-and-play path.

After all that testing, the method shown in this guide was the easiest and most reliable way I found to run the full model locally. It still depends on a community GGUF file and a modified llama.cpp build, but compared with the other options, this setup actually worked.

That said, I do not think DeepSeek V4 Flash is worth running locally right now. The setup is too painful, the framework support is still immature, and the output quality does not justify the effort.

If you want a smoother local model experience, I would recommend trying models like MiniMax M2.7 or strong quantized models such as Qwen3.6-27B instead. They are easier to run, better supported across major frameworks, faster in practice, and often produce higher-quality results with far less setup frustration.

Do I need a Hugging Face token to download the model?

Is DeepSeek V4 Flash worth running locally right now?

What does the --fit on flag do in the llama-server command?

Author

Abid Ali Awan

As a certified data scientist, I am passionate about leveraging cutting-edge technology to create innovative machine learning applications. With a strong background in speech recognition, data analysis and reporting, MLOps, conversational AI, and NLP, I have honed my skills in developing intelligent systems that can make a real impact. In addition to my technical expertise, I am also a skilled communicator with a talent for distilling complex concepts into clear and concise language. As a result, I have become a sought-after blogger on data science, sharing my insights and experiences with a growing community of fellow data professionals. Currently, I am focusing on content creation and editing, working with large language models to develop powerful and engaging content that can help businesses and individuals alike make the most of their data.

Topics

Artificial Intelligence

Large Language Models

Top LLM Courses

Track

Developing Large Language Models

16 hr

Learn to develop large language models (LLMs) with PyTorch and Hugging Face, using the latest deep learning and NLP techniques.

See Details

Start Course

Course

Large Language Models (LLMs) Concepts

2 hr

93.2K

Discover the full potential of LLMs with our conceptual course covering LLM applications, training methodologies, ethical considerations, and latest research.

See Details

Start Course

Course

Working with Llama 3

2 hr

13K

Explore the latest techniques for running the Llama LLM locally and integrating it within your stack.

See Details

Start Course

Tutorial

How to Run GLM-4.7 Locally with llama.cpp: A High-Performance Guide

Setting up llama.cpp to run the GLM-4.7 model on a single NVIDIA H100 80GB GPU, achieving up to 20 tokens per second using GPU offloading, Flash Attention, optimized context size, efficient batching, and tuned CPU threading.

Abid Ali Awan

Tutorial

How to Set Up and Run DeepSeek R1 Locally With Ollama

Learn how to install, set up, and run DeepSeek-R1 locally with Ollama and build a simple RAG application.

Aashi Dutt

Tutorial

How to Run Qwen 3.5 Locally on a Single GPU: Step-by-Step Guide

Run the latest Qwen model on a single GPU VM, set up llama.cpp, and securely access it locally over SSH through a private OpenAI-compatible endpoint.

Abid Ali Awan

Tutorial

How to Run GLM 4.7 Flash Locally

Learn how to run GLM-4.7-Flash on an RTX 3090 for fast local inference and integrating with OpenCode to build a fully local automated AI coding agent.

Abid Ali Awan

Tutorial

How to Use DeepSeek Janus-Pro Locally

Learn how to set up the DeepSeek Janus-Pro project, build your own Docker image, and run the Janus web application locally on your laptop.

Abid Ali Awan

Tutorial

Fine-tuning Llama 3.2 and Using It Locally: A Step-by-Step Guide

Learn how to access Llama 3.2 lightweight and vision models on Kaggle, fine-tune the model on a custom dataset using free GPUs, merge and export the model to the Hugging Face Hub, and convert the fine-tuned model to GGUF format so it can be used locally with the Jan application.

Abid Ali Awan

See More See More

Step 1: Set Up the RunPod Environment

Step 2: Install the Modified llama.cpp Build

Step 3: Download the DeepSeek V4 Flash Model

Step 4: Serve DeepSeek V4 Flash with llama.cpp

Step 5: Testing DeepSeek V4 Flash Locally

Test 1: UI and web page generation

Test 2: Writing and Explanation

Test 3: Math and reasoning

Test 4: Full Python project generation

Overall evaluation of the local DeepSeek V4 Flash

Final Thoughts

Running DeepSeek V4 Flash Locally FAQs

What does the --fit on flag do in the llama-server command?

How to Run GLM-4.7 Locally with llama.cpp: A High-Performance Guide

How to Set Up and Run DeepSeek R1 Locally With Ollama

How to Run Qwen 3.5 Locally on a Single GPU: Step-by-Step Guide

How to Run GLM 4.7 Flash Locally

How to Use DeepSeek Janus-Pro Locally

Fine-tuning Llama 3.2 and Using It Locally: A Step-by-Step Guide

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Developing Large Language Models

Large Language Models (LLMs) Concepts

Working with Llama 3

How to Run GLM-4.7 Locally with llama.cpp: A High-Performance Guide

How to Set Up and Run DeepSeek R1 Locally With Ollama

How to Run Qwen 3.5 Locally on a Single GPU: Step-by-Step Guide

How to Run GLM 4.7 Flash Locally

How to Use DeepSeek Janus-Pro Locally

Fine-tuning Llama 3.2 and Using It Locally: A Step-by-Step Guide

Developing Large Language Models