Tutorial: How to Run DeepSeek-V3-0324 Locally

March 28, 2025

7 minutes

INDUSTRY INFORMATION

243 Views

How to run DeepSeek-V3-0324 locally using our dynamic quants which recovers accuracy

DeepSeek is at it again! After releasing V3, R1 Zero and R1 back in December 2024 and January 2025, DeepSeek updated their checkpoints / models for V3, and released a March update!

According to DeepSeek, MMLU-Pro jumped +5.3% to 81.2%. GPQA +9.3% points. AIME + 19.8% and LiveCodeBench + 10.0%! They provided a plot showing how they compared to the previous V3 checkpoint and other models like GPT 4.5 and Claude Sonnet 3.7. But how do we run a 671 billion parameter model locally?

MoE Bits	Type	Disk Size	Accuracy	Link	Details
1.78bit	IQ1_S	173GB	Ok	Link	2.06/1.56bit
1.93bit	IQ1_M	183GB	Fair	Link	2.5/2.06/1.56
2.42bit	IQ2_XXS	203GB	Suggested	Link	2.5/2.06bit
2.71bit	Q2_K_XL	231GB	Suggested	Link	3.5/2.5bit
3.5bit	Q3_K_XL	320GB	Great	Link	4.5/3.5bit
4.5bit	Q4_K_XL	406GB	Best	Link	5.5/4.5bit

DeepSeek V3's original upload is in float8, which takes 715GB. Using Q4_K_M halves the file size to 404GB or so, and our dynamic 1.78bit quant fits in around 151GB. I suggest using our 2.7bit quant to balance size and accuracy! The 2.4bit one also works well!

⚙️ Official Recommended Settings

According to DeepSeek, these are the recommended settings for inference:

Temperature of 0.3 (Maybe 0.0 for coding as seen here)
Min_P of 0.00 (optional, but 0.01 works well, llama.cpp default is 0.1)
Chat template: <｜User｜>Create a simple playable Flappy Bird Game in Python. Place the final game inside of a markdown section.<｜Assistant｜>
A BOS token of <｜begin▁of▁sentence｜> is auto added during tokenization (do NOT add it manually!)
DeepSeek mentioned using a system prompt as well (optional) - it's in Chinese: 该助手为DeepSeek Chat，由深度求索公司创造。\n今天是3月24日，星期一。 which translates to: The assistant is DeepSeek Chat, created by DeepSeek.\nToday is Monday, March 24th.
For KV cache quantization, use 8bit, NOT 4bit - we found it to do noticeably worse.

📖 Tutorial: How to Run DeepSeek-V3 in llama.cpp

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

NOTE using -DGGML_CUDA=ON for GPUs might take 5 minutes to compile. CPU only takes 1 minute to compile. You might be interested in llama.cpp's precompiled binaries.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-IQ1_S(dynamic 1.78bit quant) or other quantized versions like Q4_K_M . I recommend using our 2.7bit dynamic quant UD-Q2_K_XL to balance size and accuracy. More versions at: https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/DeepSeek-V3-0324-GGUF",
    local_dir = "unsloth/DeepSeek-V3-0324-GGUF",
    allow_patterns = ["*UD-Q2_K_XL*"], # Dynamic 2.7bit (230GB) Use "*UD-IQ_S*" for Dynamic 1.78bit (151GB)
)

Run Unsloth's Flappy Bird test as described in our 1.58bit Dynamic Quant for DeepSeek R1.
Edit --threads 32 for the number of CPU threads, --ctx-size 16384 for context length, --n-gpu-layers 2 for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference.

./llama.cpp/llama-cli \
    --model unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \
    --cache-type-k q8_0 \
    --threads 20 \
    --n-gpu-layers 2 \
    -no-cnv \
    --prio 3 \
    --temp 0.3 \
    --min_p 0.01 \
    --ctx-size 4096 \
    --seed 3407 \
    --prompt "<｜User｜>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<｜Assistant｜>"

If we run the above, we get the old 2 bit on the left (seizure warning sorry!)

🎱 Heptagon Test

We also test our dynamic quants via r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.

The goal is to make the heptagon spin, and the balls in the heptagon should move.Copy

./llama.cpp/llama-cli \
    --model unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \
    --cache-type-k q8_0 \
    --threads 20 \
    --n-gpu-layers 2 \
    -no-cnv \
    --prio 3 \
    --temp 0.3 \
    --min_p 0.01 \
    --ctx-size 4096 \
    --seed 3407 \
    --prompt "<｜User｜>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<｜Assistant｜>"

Cover

Non Dynamic 2bit. Fails - SEIZURE WARNING again!

unsloth-q2_k_rotate.txt

Cover

Dynamic 2bit. Actually solves the heptagon puzzle correctly!!

unsloth-q2_k_xl_rotate.txt

Cover

Original float8

fp8-heptagon.txt

The dynamic 2.7 bit quant which is only 230GB in size actually manages to solve the heptagon puzzle! The full output for all 3 versions (including full fp8) is below:Dynamic 2bit Heptagon code

🕵️ Extra Findings & Tips

We find using lower KV cache quantization (4bit) seems to degrade generation quality via empirical tests - more tests need to be done, but we suggest using q8_0 cache quantization. The goal of quantization is to support longer context lengths since the KV cache uses quite a bit of memory.
We found the down_proj in this model to be extremely sensitive to quantitation. We had to redo some of our dyanmic quants which used 2bits for down_proj and now we use 3bits as the minimum for all these matrices.
Using llama.cpp 's Flash Attention backend does result in somewhat faster decoding speeds. Use -DGGML_CUDA_FA_ALL_QUANTS=ON when compiling. Note it's also best to set your CUDA architecture as found in https://developer.nvidia.com/cuda-gpus to reduce compilation times, then set it via -DCMAKE_CUDA_ARCHITECTURES="80"
Using a min_p=0.01is probably enough. llama.cppdefaults to 0.1, which is probably not necessary. Since a temperature of 0.3 is used anyways, we most likely will very unlikely sample low probability tokens, so removing very unlikely tokens is a good idea. DeepSeek recommends 0.0 temperature for coding tasks.