Deploying DeepSeek 3.2 Exp on NVIDIA H200 — Learning Lessons

Technical walkthrough and lessons learned from deploying DeepSeek 3.2 Exp on high-end NVIDIA H200 GPU infrastructure.
This is a hands-on log of getting DeepSeek-V3.2-Exp (MoE) running on a single H200 box with vLLM. It covers what worked, what didn’t, how long things actually took, how to monitor it, and a repeatable runbook you can reuse.
GitHub repo: https://github.com/torontoai-hub/torontoai-llm-lab/tree/main/deepseek-3.2-Exp
Before we deploy the model, lets take a peak into the key feauture and why there is so much excitecment about this new drop
DeepSeek V3.2 (Exp) — Sparse Attention, Memory Efficiency
DSA replaces full attention O(L²) with a two-stage pipeline :
-
Lightning Indexer Head — low-precision (FP8) attention that scores relevance for each token.
-
Top-k Token Selection — retains a small subset (e.g. k = 64–128).
-
Sparse Core Attention — performs dense attention only on selected tokens.
Dense Attention — O(L²) Cost Growth

Dense Attention
In dense attention, every token in the input sequence attends to every other token. For a context of length L , this requires computing and storing L × L attention weights.
-
Each token produces query (Q), key (K), and value (V) vectors.
-
The model calculates attention for every pair (Qᵢ, Kⱼ) in the sequence.
-
This leads to quadratic complexity in both compute and memory — O(L²).
At small context sizes (e.g. 2K–8K tokens), this is manageable, but at 128K tokens , both GPU memory and compute requirements explode. For example, 128K context at FP16 precision requires ~67 GB just for KV cache , exceeding single-GPU capacity.
Dense attention is accurate but inefficient for long sequences , because it spends most of its compute attending to irrelevant tokens.
Sparse Attention (DSA) — O(L × k) Scaling

Sparse attention, as implemented in DeepSeek Sparse Attention (DSA) , introduces a filtering stage that selects only the most relevant tokens before full attention is computed.
-
A Lightning Indexer Head (computed in FP8) evaluates the relevance of each token to the current query.
-
It then selects the Top-k tokens (e.g. 64–128) instead of all L tokens.
-
Full attention is computed only on this subset — effectively reducing compute from O(L²) → O(L × k).
Because k ≪ L, the cost grows linearly with sequence length. At 128K context, DSA requires only ~0.125 GB of KV memory (vs 67 GB for dense attention), a ~99.8 % memory reduction.
This allows long-context reasoning, multi-document summarization, or retrieval-augmented pipelines to run on a single 80 GB GPU with stable latency and flat cost growth.
This resulted in drastic reduction in GPU inference cost.
Deployment TL;DR (what finally worked)
-
Model: deepseek-ai/DeepSeek-V3.2-Exp (MoE; 163 shards; several 4.30 GB + a few 1.86 GB)
-
Runtime: vLLM (OpenAI-compatible)
-
Parallelism:
- Tried -dp 8 --enable-expert-parallel → hit NCCL/TCPStore “broken pipe” issues
-
Stable bring-up: -tp 8 (Tensor Parallel across 8 H200s)
-
Warmup: Long FP8 GEMM warmups + CUDA graph capture on first run (subsequent restarts are much faster due to cache)
-
Metrics: vLLM /metrics + Prometheus + Grafana (node_exporter + dcgm-exporter recommended)
-
Client validation: One-file OpenAI-compatible Python script; plus lm-eval for GSM8K
-
Grafana: Dashboard parameterized with $model_name = deepseek-ai/DeepSeek-V3.2-Exp
-
Cloud Provider : Shadeform/Datacrunch/Iceland
-
Total Cost : $54/2 hours
Details for Developers
Minimum Requirement
As per vLLM recipe book for Deepseek , recommended GPUs are B200 or H200.
Also, Python 3.12 with CUDA 13.
GPU Hunting Strategy
For quick and affordable GPU experiments, I usually rely on shadeform.ai or runpod.ai. Luckily, I had some shadeform.aicredits left, so I used them for this run — and the setup was surprisingly smooth.B200 Trial
First I tried to get B200 node , but I had issues in getting either the BM node available or some cases, could not get nvidia driver working.
shadeform@dawvygtc:~$ sudo apt install cuda-driversReading package lists... DoneBuilding dependency tree... DoneReading state information... Donecuda-drivers is already the newest version (580.95.05-0ubuntu1).0 upgraded, 0 newly installed, 0 to remove and 165 not upgraded.shadeform@dawvygtc:~$ lspci | grep -i nvidia17:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)3d:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)60:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)70:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)98:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)bb:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)dd:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)ed:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)shadeform@dawvygtc:~$ nvidia-smiNo devices were foundshadeform@dawvygtc:~$
I could have troubleshooted, but didn’t want to pay $35/hour while I struggle with environment issues. Then I ended up killing the node and look for other node.
H200 + Ubuntu 24 + Nvidia Driver 580 — Worked
Because a full H200 node costs at least $25/hour, I didn’t want to spend time provisioning Ubuntu 22 and then upgrading to Python 3.12. Instead, I looked for an H200 image with Ubuntu 24 to reduce setup time. I ended up renting a DataCrunch H200 server in Iceland; on the first try, the Python and CUDA requirements lined up with minimal hassle, so I moved forward (it still wasn’t entirely smooth). I chose Ubuntu 24 with CUDA 12.8.
In order to get pytorch working, you need to follow exact version number. So for Nvidia driver 580, you should use CUDA 13.
# --- Install Python & pip ---
sudo apt install -y python3 python3-pip
pip install --upgrade pip
# --- Install uv package manager (optional, faster)
---curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
# --- Create and activate virtual environment ---
uv venvsource .venv/bin/activate
# --- Install PyTorch nightly build with CUDA 13.0 support ---
uv pip install --pre torch torchvision --index-url [_https://download.pytorch.org/whl/nightly/cu130_](<https://download.pytorch.org/whl/nightly/cu130>) __
# Ensure below command return "True" in your Python terminal
import torch
[_torch.cuda.is_](<http://torch.cuda.is>) _available()
Once aforesaid commands are working, start installing vllm installation
# --- Install vLLM and dependencies ---
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightlyuv pip install [_https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl_](<https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl>)
# --- Install supporting Python libraries ---
uv pip install openai transformers accelerate numpy --quiet
# --- Verify vLLM environment ---
python -c "import torch, vllm, transformers, numpy; print('✅ Environment ready')"
System Validation script
python3 system_validation.py
======================================================================SYSTEM INFORMATION======================================================================
OS: Linux 6.8.0-79-generic
Python: 3.12.3
PyTorch: 2.8.0+cu128
CUDA available: True
CUDA version: 12.8
cuDNN version: 91002
Number of GPUs: 8
======================================================================GPU DETAILS======================================================================
GPU[0]: Name: NVIDIA H200 Compute Capability: 9.0 Memory: 150.11 GB Multi-Processors: 132 Status: ✅ Hopper architecture - Supported
GPU[1]: Name: NVIDIA H200 Compute Capability: 9.0 Memory: 150.11 GB Multi-Processors: 132 Status: ✅ Hopper architecture - Supported
GPU[2]: Name: NVIDIA H200 Compute Capability: 9.0 Memory: 150.11 GB Multi-Processors: 132 Status: ✅ Hopper architecture - Supported
GPU[3]: Name: NVIDIA H200 Compute Capability: 9.0 Memory: 150.11 GB Multi-Processors: 132 Status: ✅ Hopper architecture - Supported
GPU[4]: Name: NVIDIA H200 Compute Capability: 9.0 Memory: 150.11 GB Multi-Processors: 132 Status: ✅ Hopper architecture - Supported
GPU[5]: Name: NVIDIA H200 Compute Capability: 9.0 Memory: 150.11 GB Multi-Processors: 132 Status: ✅ Hopper architecture - Supported
GPU[6]: Name: NVIDIA H200 Compute Capability: 9.0 Memory: 150.11 GB Multi-Processors: 132 Status: ✅ Hopper architecture - Supported
GPU[7]: Name: NVIDIA H200 Compute Capability: 9.0 Memory: 150.11 GB Multi-Processors: 132 Status: ✅ Hopper architecture - Supported
Total GPU Memory: 1200.88 GB
======================================================================NVLINK STATUS======================================================================
✅ NVLink detected - Multi-GPU performance will be optimal
======================================================================CONFIGURATION RECOMMENDATIONS======================================================================
✅ Sufficient GPU memory for DeepSeek-V3.2-Exp Recommended mode: EP/DP (--dp 8 --enable-expert-parallel)(shadeform)
Here is another catch, as per the vLLM official recipes, it recommends using Expert Parallelism + Data Parallelism (EP/DP), I would not recommend it for H200, unless you have extra time to troubleshoot EP/DP issues.
I would recommend using Tensor Parallel Mode (Fallback) for H200 single full node.
vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8
Downloading the model (what to expect)
DeepSeek-V3.2-Exp has a large number of shards (model-00001-of-000163.safetensors …). With 8 parallel downloads; each shard ~4.30 GB (some ~1.86 GB). With ~28–33 MB/s per stream, 8 at once gives ~220–260 MB/s aggregate (sar showed ~239 MB/s).
What the long warm-up logs mean
You’ll see long sequences like:
-
DeepGemm(fp8_gemm_nt) warmup (...) 8192/8192
-
DeepGemm(m_grouped_fp8_gemm_nt_contiguous) warmup (W=torch.Size([..., ..., ...]))
-
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE/FULL
-
vLLM / kernels are profiling & compiling FP8 GEMMs for many layer shapes.
-
MoE models do grouped GEMMs
-
CUDA Graphs are being captured for common prefill/decode paths to minimize runtime launch overhead.
-
The first start is the slowest. Compiled graphs and torch.compile artifacts are cached under:
-
~/.cache/vllm/torch_compile_cache//rank_*/backbon– subsequent restarts are much faster.
Maximum concurrency for 163,840 tokens per request: 5.04x
That’s vLLM telling you its KV-cache chunking math and how much intra-request parallelism it can achieve at that context length.
Common bring-up errors & fixes
Symptoms: TCPStore sendBytes... Broken pipe, Failed to check the "should dump" flag, API returns HTTP 500, server shuts down.
Usual causes & fixes:
-
A worker/rank died (OOM, kernel assert, unexpected shape) → All ranks try to talk to a dead TCPStore → broken pipe spam.
-
Mismatched parallelism vs GPU count → keep it simple: -tp 8 on 8 GPUs; only 1 form of parallelism while stabilizing.
-
No IB on the host? → export NCCL_IB_DISABLE=1
-
Kernel/driver hiccups → verify nvidia-smi is stable; check dmesg.
-
Don’t send traffic during warmup/graph capture ; wait until you see the final “All ranks ready”/Uvicorn up logs.
Metrics: Prometheus & exporters
You can simply deploy the Monitoring stack from the git repo
docker compose up -d
You should be able to access the Grafana UI on default user/password ( admin/admin)
http://<publicIP>:3000
You need to add Prometheus data source ( default) and then import the Grafana Dashboard JSON customized for Deepseek V.3.2
Now — Show time
If you see unicorn logs, you can start firing Tests and validation.Final Output
Zero-Shot Evaluation
lm-eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=[_http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False_](<http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False>)
It could take few minutes to load all the tests
NFO 10-08 01:58:52 [__init__.py:224] Automatically detected platform cuda.2025-10-08:01:58:55 INFO [__main__:446] Selected Tasks: ['gsm8k']2025-10-08:01:58:55 INFO [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 12342025-10-08:01:58:55 INFO [evaluator:240] Initializing local-completions model, with arguments: {'model': 'deepseek-ai/DeepSeek-V3.2-Exp', 'base_url': 'http://127.0.0.1:8000/v1/completions', 'num_concurrent': 100, 'max_retries': 3, 'tokenized_requests': False}2025-10-08:01:58:55 INFO [models.api_models:170] Using max length 2048 - 12025-10-08:01:58:55 INFO [models.api_models:189] Using tokenizer huggingfaceREADME.md: 7.94kB [00:00, 18.2MB/s]main/train-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:01<00:00, 1.86MB/s]main/test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419k/419k [00:00<00:00, 1.38MB/s]Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 342925.03 examples/s]Generating test split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 212698.46 examples/s]2025-10-08:01:59:02 INFO [evaluator:305] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}2025-10-08:01:59:02 INFO [api.task:434] Building contexts for gsm8k on rank 0...100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:03<00:00, 402.50it/s]2025-10-08:01:59:05 INFO [evaluator:574] Running generate_until requests2025-10-08:01:59:05 INFO [models.api_models:692] Tokenized requests are disabled. Context + generation length is not checked.Requesting API: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [04:55<00:00, 4.47it/s]fatal: not a git repository (or any of the parent directories): .git2025-10-08:02:04:03 INFO [loggers.evaluation_tracker:280] Output path not provided, skipping saving results aggregatedlocal-completions (model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1|
Final result — which matches with the official doc
|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr||-----|------:|----------------|-----:|-----------|---|-----:|---|-----:||gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.9507|± |0.0060|| | |strict-match | 5|exact_match|↑ |0.9484|± |0.0061|
Few-Shot Evaluation (20 examples)
lm-eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False --num_fewshot 20
Result looks pretty good
You can observe the Grafana dashboard for Analytics

Grafana vLLM Dashboard - Deepseek V.3.2 Exp