How to Get Freaky Fonts and Optimize Delivery
Freaky fonts are becoming an essential part of modern d...
How to run DeepSeek-V3-0324 locally using our dynamic quants which recovers accuracy
DeepSeek is at it again! After releasing V3, R1 Zero and R1 back in December 2024 and January 2025, DeepSeek updated their checkpoints / models for V3, and released a March update!
According to DeepSeek, MMLU-Pro jumped +5.3% to 81.2%. GPQA +9.3% points. AIME + 19.8% and LiveCodeBench + 10.0%! They provided a plot showing how they compared to the previous V3 checkpoint and other models like GPT 4.5 and Claude Sonnet 3.7. But how do we run a 671 billion parameter model locally?
MoE Bits | Type | Disk Size | Accuracy | Link | Details |
---|---|---|---|---|---|
1.78bit | IQ1_S | 173GB | Ok | Link | 2.06/1.56bit |
1.93bit | IQ1_M | 183GB | Fair | Link | 2.5/2.06/1.56 |
2.42bit | IQ2_XXS | 203GB | Suggested | Link | 2.5/2.06bit |
2.71bit | Q2_K_XL | 231GB | Suggested | Link | 3.5/2.5bit |
3.5bit | Q3_K_XL | 320GB | Great | Link | 4.5/3.5bit |
4.5bit | Q4_K_XL | 406GB | Best | Link | 5.5/4.5bit |
DeepSeek V3's original upload is in float8, which takes 715GB. Using Q4_K_M halves the file size to 404GB or so, and our dynamic 1.78bit quant fits in around 151GB. I suggest using our 2.7bit quant to balance size and accuracy! The 2.4bit one also works well!
According to DeepSeek, these are the recommended settings for inference:
<|User|>Create a simple playable Flappy Bird Game in Python. Place the final game inside of a markdown section.<|Assistant|>
<|begin▁of▁sentence|>
is auto added during tokenization (do NOT add it manually!)该助手为DeepSeek Chat,由深度求索公司创造。\n今天是3月24日,星期一。
which translates to: The assistant is DeepSeek Chat, created by DeepSeek.\nToday is Monday, March 24th.
llama.cpp
on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON
to -DGGML_CUDA=OFF
if you don't have a GPU or just want CPU inference.NOTE using -DGGML_CUDA=ON
for GPUs might take 5 minutes to compile. CPU only takes 1 minute to compile. You might be interested in llama.cpp's precompiled binaries.
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-quantize llama-cli llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp
pip install huggingface_hub hf_transfer
). You can choose UD-IQ1_S
(dynamic 1.78bit quant) or other quantized versions like Q4_K_M
. I recommend using our 2.7bit dynamic quant UD-Q2_K_XL
to balance size and accuracy. More versions at: https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
repo_id = "unsloth/DeepSeek-V3-0324-GGUF",
local_dir = "unsloth/DeepSeek-V3-0324-GGUF",
allow_patterns = ["*UD-Q2_K_XL*"], # Dynamic 2.7bit (230GB) Use "*UD-IQ_S*" for Dynamic 1.78bit (151GB)
)
--threads 32
for the number of CPU threads, --ctx-size 16384
for context length, --n-gpu-layers 2
for GPU offloading on how many layers. Try adjusting it if your GPU goes out of memory. Also remove it if you have CPU only inference../llama.cpp/llama-cli \
--model unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \
--cache-type-k q8_0 \
--threads 20 \
--n-gpu-layers 2 \
-no-cnv \
--prio 3 \
--temp 0.3 \
--min_p 0.01 \
--ctx-size 4096 \
--seed 3407 \
--prompt "<|User|>Create a Flappy Bird game in Python. You must include these things:\n1. You must use pygame.\n2. The background color should be randomly chosen and is a light shade. Start with a light blue color.\n3. Pressing SPACE multiple times will accelerate the bird.\n4. The bird's shape should be randomly chosen as a square, circle or triangle. The color should be randomly chosen as a dark color.\n5. Place on the bottom some land colored as dark brown or yellow chosen randomly.\n6. Make a score shown on the top right side. Increment if you pass pipes and don't hit them.\n7. Make randomly spaced pipes with enough space. Color them randomly as dark green or light brown or a dark gray shade.\n8. When you lose, show the best score. Make the text inside the screen. Pressing q or Esc will quit the game. Restarting is pressing SPACE again.\nThe final game should be inside a markdown section in Python. Check your code for errors and fix them before the final markdown section.<|Assistant|>"
If we run the above, we get the old 2 bit on the left (seizure warning sorry!)
We also test our dynamic quants via r/Localllama which tests the model on creating a basic physics engine to simulate balls rotating in a moving enclosed heptagon shape.
The goal is to make the heptagon spin, and the balls in the heptagon should move.Copy
./llama.cpp/llama-cli \
--model unsloth/DeepSeek-V3-0324-GGUF/UD-Q2_K_XL/DeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf \
--cache-type-k q8_0 \
--threads 20 \
--n-gpu-layers 2 \
-no-cnv \
--prio 3 \
--temp 0.3 \
--min_p 0.01 \
--ctx-size 4096 \
--seed 3407 \
--prompt "<|User|>Write a Python program that shows 20 balls bouncing inside a spinning heptagon:\n- All balls have the same radius.\n- All balls have a number on it from 1 to 20.\n- All balls drop from the heptagon center when starting.\n- Colors are: #f8b862, #f6ad49, #f39800, #f08300, #ec6d51, #ee7948, #ed6d3d, #ec6800, #ec6800, #ee7800, #eb6238, #ea5506, #ea5506, #eb6101, #e49e61, #e45e32, #e17b34, #dd7a56, #db8449, #d66a35\n- The balls should be affected by gravity and friction, and they must bounce off the rotating walls realistically. There should also be collisions between balls.\n- The material of all the balls determines that their impact bounce height will not exceed the radius of the heptagon, but higher than ball radius.\n- All balls rotate with friction, the numbers on the ball can be used to indicate the spin of the ball.\n- The heptagon is spinning around its center, and the speed of spinning is 360 degrees per 5 seconds.\n- The heptagon size should be large enough to contain all the balls.\n- Do not use the pygame library; implement collision detection algorithms and collision response etc. by yourself. The following Python libraries are allowed: tkinter, math, numpy, dataclasses, typing, sys.\n- All codes should be put in a single Python file.<|Assistant|>"
Non Dynamic 2bit. Fails - SEIZURE WARNING again!
Dynamic 2bit. Actually solves the heptagon puzzle correctly!!
Original float8
The dynamic 2.7 bit quant which is only 230GB in size actually manages to solve the heptagon puzzle! The full output for all 3 versions (including full fp8) is below:Dynamic 2bit Heptagon code
q8_0
cache quantization. The goal of quantization is to support longer context lengths since the KV cache uses quite a bit of memory.down_proj
in this model to be extremely sensitive to quantitation. We had to redo some of our dyanmic quants which used 2bits for down_proj
and now we use 3bits as the minimum for all these matrices.llama.cpp
's Flash Attention backend does result in somewhat faster decoding speeds. Use -DGGML_CUDA_FA_ALL_QUANTS=ON
when compiling. Note it's also best to set your CUDA architecture as found in https://developer.nvidia.com/cuda-gpus to reduce compilation times, then set it via -DCMAKE_CUDA_ARCHITECTURES="80"
min_p=0.01
is probably enough. llama.cpp
defaults to 0.1, which is probably not necessary. Since a temperature of 0.3 is used anyways, we most likely will very unlikely sample low probability tokens, so removing very unlikely tokens is a good idea. DeepSeek recommends 0.0 temperature for coding tasks.References:https://docs.unsloth.ai/basics/tutorial-how-to-run-deepseek-v3-0324-locally
Freaky fonts are becoming an essential part of modern d...
Ever wished you could directly ask questions to a PDF o...
When choosing the right CMS for your website, WordPress...