Skip to content

Latest commit

 

History

History
188 lines (143 loc) · 7.4 KB

File metadata and controls

188 lines (143 loc) · 7.4 KB

Running DeepSeek-V4-Flash with SGLang and KT-Kernel

This tutorial demonstrates how to run DeepSeek-V4-Flash model inference using SGLang integrated with KT-Kernel for CPU-GPU heterogeneous inference. The hybrid path splits MXFP4 routed experts between CPU (KT-Kernel cpuinfer) and GPU (sglang kt-num-gpu-experts), enabling deployment on consumer-grade hardware.

Table of Contents

Hardware Requirements

Validated Configuration (this tutorial):

  • GPU: 1× NVIDIA RTX 5090 (32GB VRAM, SM_120)
  • CPU: x86 CPU with AVX512 support
  • RAM: ≥256GB system memory
  • Storage: ~340GB for model weights

Supported GPU architectures (auto-detected at startup; non-validated configurations should work but have not been benchmarked end-to-end):

Arch Compute Cap MXFP4 MoE NSA sparse MLA Validated
Hopper (H100 / H200) SM_90 triton_kernels flash_mla wheel
Datacenter Blackwell (B100 / B200) SM_100 trtllm-fp4 Triton fallback
Consumer Blackwell (RTX 5090) SM_120 triton_kernels Triton fallback
Ada Lovelace (RTX 4090 / L20 / L40) SM_89 triton_kernels Triton fallback
Ampere (A100 / A6000) SM_80 / SM_86 triton_kernels Triton fallback ✗ (not supported)

Prerequisites

  1. KT-Kernel installed:

    git clone https://github.com/kvcache-ai/ktransformers.git
    cd ktransformers
    git submodule update --init --recursive
    cd kt-kernel && ./install.sh
  2. SGLang installed (kvcache-ai fork):

    ./install.sh   # from ktransformers root
  3. CUDA 12.8+ and flashinfer ≥ 0.6.9 (flashinfer-python and flashinfer-cubin must be the same version):

    pip install --upgrade flashinfer-python flashinfer-cubin

    This upgrade is required (even though sglang-kt pins flashinfer_python==0.6.3) because V4-Flash's MXFP4 MoE module imports mxfp8_quantize, trtllm_fp4_block_scale_routed_moe, etc., which only exist in flashinfer ≥ 0.6.9.

  4. transformers==4.57.1 (V4-Flash is incompatible with the 5.x series):

    pip install "transformers==4.57.1"

    transformers 5.x adds default-valued fields to PretrainedConfig that make DeepSeekV4Config's dataclass declaration raise TypeError: non-default argument 'quantization_config' follows default argument at import time. sglang-kt's pyproject does not pin transformers, so a fresh pip install will pull the latest 5.x and break server startup; pinning explicitly to 4.57.1 is required until the upstream fix lands.

  5. tilelang (manual install — required for the NSA sparse-MLA tilelang indexer path used on non-Hopper GPUs):

    pip install tilelang

    sglang-kt's pyproject does not declare tilelang as a dependency, so pip install ./python[all] will not pull it in. Validated with tilelang==0.1.8.

Step 1: Download Model Weights

mkdir -p /path/to/models
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash \
  --local-dir /path/to/models/DeepSeek-V4-Flash

Step 2: Quantize CPU Weights (Optional, for AMXINT4 mode)

This step is only needed if you want to run the CPU experts in AMXINT4 mode instead (e.g., on Intel Xeon with AMX where INT4 is preferred over MXFP4).

Conversion Command

For a 4-NUMA system with 64 physical cores assigned to CPU inference:

cd /path/to/ktransformers/kt-kernel

python scripts/convert_cpu_weights_ds4.py \
  --input-path /path/to/models/DeepSeek-V4-Flash \
  --input-type fp4 \
  --output /path/to/models/DeepSeek-V4-Flash-AMXINT4 \
  --quant-method int4 \
  --cpuinfer-threads 64 \
  --threadpool-count 4 \
  --no-merge-safetensor

The script auto-detects model_type=deepseek_v4 and expert_dtype=fp4 from config.json, dequantizes the MXFP4 routed experts (group size 32) on GPU, and re-quantizes them to AMX-INT4 layout on CPU. Both HF (model.layers.{L}.mlp.experts.{E}.{proj}.weight) and V4 inference (layers.{L}.ffn.experts.{E}.{w1,w2,w3}.weight) key formats are supported.

To use the converted weights, replace the relevant flags in Step 3's launch command:

  --kt-weight-path /path/to/models/DeepSeek-V4-Flash-AMXINT4 \
  --kt-method AMXINT4 \

Step 3: Launch SGLang Server

Launch Command (Single RTX 5090 Example)

export FLASHINFER_CUDA_ARCH_LIST=12.0a
export TORCH_CUDA_ARCH_LIST="12.0+PTX"
export SGLANG_DSV4_MODE=2604
export SGLANG_DSV4_2604_SUBMODE=2604B

numactl --interleave=all python -m sglang.launch_server \
  --host 0.0.0.0 --port 30000 \
  --model /path/to/models/DeepSeek-V4-Flash \
  --kt-weight-path /path/to/models/DeepSeek-V4-Flash \
  --kt-method MXFP4 \
  --kt-num-gpu-experts 10 \
  --kt-cpuinfer 60 \
  --kt-threadpool-count 2 \
  --kt-gpu-prefill-token-threshold 4096 \
  --kt-enable-dynamic-expert-update \
  --tensor-parallel-size 1 \
  --context-length 16384 \
  --attention-backend flashinfer \
  --mem-fraction-static 0.85 \
  --chunked-prefill-size 2048 \
  --max-prefill-tokens 2048 \
  --max-running-requests 2 \
  --watchdog-timeout 1200 \
  --disable-shared-experts-fusion \
  --trust-remote-code \
  --cuda-graph-bs 1 \
  --cuda-graph-max-bs 1 \
  --disable-radix-cache \
  --skip-server-warmup

Decode throughput: 20+ tok/s on a single RTX 5090.

It takes about 4-5 minutes to start the server (weight load + CUDA Graph capture).

See KT-Kernel Parameters for detailed parameter tuning guidelines.

Optional: Enable MTP (Multi-Token Prediction) Speculative Decoding

V4-Flash ships a NextN draft head that can be run as EAGLE-style speculative decoding for ~1.2× throughput on single-request decode (validated 26.5 → 32.74 tok/s on 8× RTX 5090, 90% accept rate at chain depth 1).

Append the following flags to the launch command above:

  --speculative-algorithm EAGLE \
  --speculative-num-steps 3 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 4 \
  --speculative-moe-runner-backend auto \

Step 4: Send Inference Requests

Decode

curl -s -X POST http://127.0.0.1:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Explain quantum computing in detail:",
    "sampling_params": {"temperature": 0.0, "max_new_tokens": 256}
  }'

Interactive Chat (kt chat)

The kt CLI ships with an OpenAI-compatible chat client that talks to the SGLang server's /v1/chat/completions endpoint:

kt chat --host 127.0.0.1 --port 30000 --temperature 0.7 --max-tokens 2048