This tutorial demonstrates how to run DeepSeek-V4-Flash model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. The hybrid path splits MXFP4 routed experts between CPU (KT-Kernel cpuinfer) and GPU (sglang kt-num-gpu-experts), enabling deployment on consumer-grade hardware.
- Running DeepSeek-V4-Flash with SGLang and KT-Kernel
Validated Configuration (this tutorial):
- GPU: 1× NVIDIA RTX 5090 (32GB VRAM, SM_120)
- CPU: x86 CPU with AVX512 support
- RAM: ≥256GB system memory
- Storage: ~340GB for model weights
Supported GPU architectures (auto-detected at startup; non-validated configurations should work but have not been benchmarked end-to-end):
| Arch | Compute Cap | MXFP4 MoE | NSA sparse MLA | Validated |
|---|---|---|---|---|
| Hopper (H100 / H200) | SM_90 | triton_kernels | flash_mla wheel | — |
| Datacenter Blackwell (B100 / B200) | SM_100 | trtllm-fp4 | Triton fallback | — |
| Consumer Blackwell (RTX 5090) | SM_120 | triton_kernels | Triton fallback | ✓ |
| Ada Lovelace (RTX 4090 / L20 / L40) | SM_89 | triton_kernels | Triton fallback | ✓ |
| Ampere (A100 / A6000) | SM_80 / SM_86 | triton_kernels | Triton fallback | ✗ (not supported) |
-
KT-Kernel installed:
git clone https://github.com/kvcache-ai/ktransformers.git cd ktransformers git submodule update --init --recursive cd kt-kernel && ./install.sh
-
SGLang installed (kvcache-ai fork):
./install.sh # from ktransformers root -
CUDA 12.8+ and flashinfer ≥ 0.6.9 (
flashinfer-pythonandflashinfer-cubinmust be the same version):pip install --upgrade flashinfer-python flashinfer-cubin
This upgrade is required (even though
sglang-ktpinsflashinfer_python==0.6.3) because V4-Flash's MXFP4 MoE module importsmxfp8_quantize,trtllm_fp4_block_scale_routed_moe, etc., which only exist in flashinfer ≥ 0.6.9. -
transformers==4.57.1 (V4-Flash is incompatible with the 5.x series):
pip install "transformers==4.57.1"transformers5.x adds default-valued fields toPretrainedConfigthat makeDeepSeekV4Config's dataclass declaration raiseTypeError: non-default argument 'quantization_config' follows default argumentat import time.sglang-kt's pyproject does not pintransformers, so a freshpip installwill pull the latest 5.x and break server startup; pinning explicitly to4.57.1is required until the upstream fix lands. -
tilelang (manual install — required for the NSA sparse-MLA tilelang indexer path used on non-Hopper GPUs):
pip install tilelang
sglang-kt's pyproject does not declaretilelangas a dependency, sopip install ./python[all]will not pull it in. Validated withtilelang==0.1.8.
mkdir -p /path/to/models
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
--local-dir /path/to/models/DeepSeek-V4-FlashThis step is only needed if you want to run the CPU experts in AMXINT4 mode instead (e.g., on Intel Xeon with AMX where INT4 is preferred over MXFP4).
For a 4-NUMA system with 64 physical cores assigned to CPU inference:
cd /path/to/ktransformers/kt-kernel
python scripts/convert_cpu_weights_ds4.py \
--input-path /path/to/models/DeepSeek-V4-Flash \
--input-type fp4 \
--output /path/to/models/DeepSeek-V4-Flash-AMXINT4 \
--quant-method int4 \
--cpuinfer-threads 64 \
--threadpool-count 4 \
--no-merge-safetensorThe script auto-detects model_type=deepseek_v4 and expert_dtype=fp4 from config.json, dequantizes the MXFP4 routed experts (group size 32) on GPU, and re-quantizes them to AMX-INT4 layout on CPU. Both HF (model.layers.{L}.mlp.experts.{E}.{proj}.weight) and V4 inference (layers.{L}.ffn.experts.{E}.{w1,w2,w3}.weight) key formats are supported.
To use the converted weights, replace the relevant flags in Step 3's launch command:
--kt-weight-path /path/to/models/DeepSeek-V4-Flash-AMXINT4 \
--kt-method AMXINT4 \export FLASHINFER_CUDA_ARCH_LIST=12.0a
export TORCH_CUDA_ARCH_LIST="12.0+PTX"
export SGLANG_DSV4_MODE=2604
export SGLANG_DSV4_2604_SUBMODE=2604B
numactl --interleave=all python -m sglang.launch_server \
--host 0.0.0.0 --port 30000 \
--model /path/to/models/DeepSeek-V4-Flash \
--kt-weight-path /path/to/models/DeepSeek-V4-Flash \
--kt-method MXFP4 \
--kt-num-gpu-experts 10 \
--kt-cpuinfer 60 \
--kt-threadpool-count 2 \
--kt-gpu-prefill-token-threshold 4096 \
--kt-enable-dynamic-expert-update \
--tensor-parallel-size 1 \
--context-length 16384 \
--attention-backend flashinfer \
--mem-fraction-static 0.85 \
--chunked-prefill-size 2048 \
--max-prefill-tokens 2048 \
--max-running-requests 2 \
--watchdog-timeout 1200 \
--disable-shared-experts-fusion \
--trust-remote-code \
--cuda-graph-bs 1 \
--cuda-graph-max-bs 1 \
--disable-radix-cache \
--skip-server-warmupDecode throughput: 20+ tok/s on a single RTX 5090.
It takes about 4-5 minutes to start the server (weight load + CUDA Graph capture).
See KT-Kernel Parameters for detailed parameter tuning guidelines.
V4-Flash ships a NextN draft head that can be run as EAGLE-style speculative decoding for ~1.2× throughput on single-request decode (validated 26.5 → 32.74 tok/s on 8× RTX 5090, 90% accept rate at chain depth 1).
Append the following flags to the launch command above:
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--speculative-moe-runner-backend auto \curl -s -X POST http://127.0.0.1:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Explain quantum computing in detail:",
"sampling_params": {"temperature": 0.0, "max_new_tokens": 256}
}'The kt CLI ships with an OpenAI-compatible chat client that talks to the SGLang server's /v1/chat/completions endpoint:
kt chat --host 127.0.0.1 --port 30000 --temperature 0.7 --max-tokens 2048