Run DeepSeek V4 Flash Locally with ds4.c — antirez's Purpose-Built Inference Engine
What is ds4.c?
ds4.c (DwarfStar 4) is a small, native inference engine purpose-built for DeepSeek V4 Flash. Created by antirez (Salvatore Sanfilippo, the creator of Redis), it is not a generic GGUF runner or a wrapper around existing runtimes — it is a completely self-contained engine that does one thing and does it well.
Unlike llama.cpp (which it acknowledges as a heavy inspiration), ds4.c is intentionally narrow: it only runs DeepSeek V4 Flash GGUF files that are specifically crafted for this engine. This laser focus allows it to optimize deeply for the model's unique architecture — compressed KV cache, routed MoE quantization, and disk-backed context persistence.
Why is it Trending?
ds4.c shot to over 9,000 GitHub stars in just over a week because it solves a real pain point: running a 284-billion parameter MoE model locally on consumer hardware. Key reasons for its popularity:
- 64% fewer active parameters than equivalent dense models — DeepSeek V4 Flash uses a Mixture-of-Experts architecture with only ~37B active parameters per token
- 1 million token context window — vast context that fits on a single Mac thanks to extreme KV compression
- 2-bit quantization that actually works — asymmetrical quantization (routed experts at IQ2_XXS / Q2_K, shared components untouched) keeps quality high while fitting in 96-128GB of RAM
- Disk-backed KV cache — unique approach treating SSD as a first-class KV citizen, allowing context persistence across sessions
- Made by antirez — Redis creator's reputation brings immediate credibility and curiosity
- Agent-ready — built-in OpenAI-compatible, Anthropic-compatible, and OpenAI Responses API endpoints for Codex, Claude Code, Pi, and OpenCode
Prerequisites
- A Mac with Apple Silicon (M3 Max / M3 Ultra recommended) and 96GB+ RAM, OR a Linux machine with NVIDIA CUDA GPU (DGX Spark/GB10 ideal)
- macOS 14+ or Linux with CUDA 12+
- At least 100GB free disk space for model weights
- Basic familiarity with the command line
- For the 2-bit quantized model: ~81GB RAM available
- For the 4-bit quantized model: 256GB+ RAM
Setup & Installation
1. Clone the repository
git clone https://github.com/antirez/ds4.git
cd ds4
2. Download the model weights
The project provides a convenient download script that fetches GGUFs from Hugging Face:
# For machines with 96-128GB RAM (recommended starting point)
./download_model.sh q2-imatrix
# For machines with 256GB+ RAM
./download_model.sh q4-imatrix
The script downloads from https://huggingface.co/antirez/deepseek-v4-gguf and stores files under ./gguf/. It supports resume with curl -C - for interrupted downloads.
3. Build the engine
# macOS with Metal
make
# Linux with CUDA (DGX Spark / GB10)
make cuda-spark
# Linux with CUDA (other GPUs)
make cuda-generic
# CPU-only (debug/diagnostics only — not recommended for production)
make cpu
4. Verify it works
Run a quick test prompt:
./ds4 -p "Explain what makes MoE architecture efficient in one paragraph."
You should see the model generate a response using the Metal or CUDA backend automatically.
Architecture Overview
The architecture of ds4.c is built around three key design decisions:
1. Model-Aware Engine, Not a General Runner ds4.c is not a generic GGUF loader. It expects DeepSeek V4 Flash GGUFs with a specific tensor layout, quantization mix, and metadata. This lets it hardcode optimizations that a general-purpose runner cannot: the exact layer structure of a 284B MoE model, the compressed KV cache format, and the MTP (Multi-Token Prediction) speculative decoding path.
2. KV Cache as a First-Class Disk Citizen Traditional inference engines keep KV cache in RAM. ds4.c flips this assumption: modern NVMe SSDs are fast enough that the compressed KV cache of DeepSeek V4 can live on disk. The server automatically writes checkpoints at strategic moments (cold start, continued generation, eviction, shutdown) and maps them back by SHA1 of the rendered prompt prefix. This enables context reuse across sessions and server restarts — a crucial feature for agent workflows where the initial prompt is ~25k tokens.
3. Asymmetrical Quantization Strategy The model is quantized asymmetrically: routed MoE expert weights (which account for ~95% of parameters) use aggressive 2-bit quantization (IQ2_XXS for up/gate projections, Q2_K for down projections), while shared experts, embeddings, and routing components are left untouched. The result is a ~81GB model that retains surprising quality — good enough for tool calling, code generation, and multi-turn conversations.
Server Mode (Agent Integration)
ds4.c really shines when used as a local inference server for coding agents:
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
Supported endpoints:
- OpenAI API:
POST /v1/chat/completions— works with Pi, OpenCode, and any OpenAI-compatible client - OpenAI Responses:
POST /v1/responses— preferred for Codex CLI - Anthropic API:
POST /v1/messages— works with Claude Code through a simple wrapper
Codex CLI Configuration
Add to ~/.config/opencode/opencode.json or use the TOML provider:
[model_providers.ds4]
name = "DS4"
base_url = "http://127.0.0.1:8000/v1"
wire_api = "responses"
stream_idle_timeout_ms = 1000000
Then run: codex --model deepseek-v4-flash -c model_provider=ds4
Claude Code Wrapper
#!/bin/sh
unset ANTHROPIC_API_KEY
export ANTHROPIC_BASE_URL="http://127.0.0.1:8000"
export ANTHROPIC_AUTH_TOKEN="dsv4-local"
export ANTHROPIC_MODEL="deepseek-v4-flash"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
exec "$HOME/.local/bin/claude" "$@"
Thinking Modes
DeepSeek V4 Flash supports three thinking modes:
- Non-thinking — direct answers, faster generation
- Thinking (default) — chain-of-thought reasoning proportional to problem complexity
- Think Max — maximum reasoning effort, only available with sufficient context
The model's thinking section is notably shorter than other reasoning models (often 1/5 the length) and scales naturally with problem difficulty — making it usable in interactive settings where other thinking models would be too slow.
Performance Benchmarks
| Mac | Q | Prompt | Prefill | Generation |
|---|---|---|---|---|
| M3 Max, 128GB | q2 | short | 58.52 t/s | 26.68 t/s |
| M3 Max, 128GB | q2 | 11,709 tok | 250.11 t/s | 21.47 t/s |
| M3 Ultra, 512GB | q2 | short | 84.43 t/s | 36.86 t/s |
| M3 Ultra, 512GB | q2 | 11,709 tok | 468.03 t/s | 27.39 t/s |
| M3 Ultra, 512GB | q4 | short | 78.95 t/s | 35.50 t/s |
| M3 Ultra, 512GB | q4 | 12,018 tok | 448.82 t/s | 26.62 t/s |
| DGX Spark, 128GB | q2 | 7,047 tok | 343.81 t/s | 13.75 t/s |
Verification Checklist
Before using ds4.c in production:
- Model weights download completes without corruption
-
./ds4 -p "test"produces coherent output on Metal/CUDA backend - Server starts and responds to
curl http://127.0.0.1:8000/v1/models - OpenAI-compatible chat completions endpoint returns valid responses
- Disk KV cache directory is populated after first server request
- Thinking mode toggle works (
/think,/nothinkin CLI) - Tool calling works in the target coding agent
Resources
- GitHub: github.com/antirez/ds4
- Hugging Face Models: huggingface.co/antirez/deepseek-v4-gguf
- DeepSeek V4 Flash: deepseek.com
- llama.cpp (inspiration): github.com/ggml-org/llama.cpp
- Official DeepSeek DSML docs: huggingface.co/deepseek-ai