AI/ML

Deploying DeepSeek 3.2 Exp on NVIDIA H200 — Learning Lessons

Chandan Kumar

Founder, beCloudReady

October 7, 202510 min read

Deploying DeepSeek 3.2 Exp on NVIDIA H200 — Learning Lessons

Technical walkthrough and lessons learned from deploying DeepSeek 3.2 Exp on high-end NVIDIA H200 GPU infrastructure.

This is a hands-on log of getting DeepSeek-V3.2-Exp (MoE) running on a single H200 box with vLLM. It covers what worked, what didn’t, how long things actually took, how to monitor it, and a repeatable runbook you can reuse.

GitHub repo: https://github.com/torontoai-hub/torontoai-llm-lab/tree/main/deepseek-3.2-Exp

Before we deploy the model, lets take a peak into the key feauture and why there is so much excitecment about this new drop

DeepSeek V3.2 (Exp) — Sparse Attention, Memory Efficiency

DSA replaces full attention O(L²) with a two-stage pipeline :

Lightning Indexer Head — low-precision (FP8) attention that scores relevance for each token.
Top-k Token Selection — retains a small subset (e.g. k = 64–128).
Sparse Core Attention — performs dense attention only on selected tokens.

Dense Attention — O(L²) Cost Growth

LLM Dense Attention

Dense Attention

In dense attention, every token in the input sequence attends to every other token. For a context of length L , this requires computing and storing L × L attention weights.

Each token produces query (Q), key (K), and value (V) vectors.
The model calculates attention for every pair (Qᵢ, Kⱼ) in the sequence.
This leads to quadratic complexity in both compute and memory — O(L²).

At small context sizes (e.g. 2K–8K tokens), this is manageable, but at 128K tokens , both GPU memory and compute requirements explode. For example, 128K context at FP16 precision requires ~67 GB just for KV cache , exceeding single-GPU capacity.

Dense attention is accurate but inefficient for long sequences , because it spends most of its compute attending to irrelevant tokens.

Sparse Attention (DSA) — O(L × k) Scaling

LLM Deepseek Sparse Attention

Sparse attention, as implemented in DeepSeek Sparse Attention (DSA) , introduces a filtering stage that selects only the most relevant tokens before full attention is computed.

A Lightning Indexer Head (computed in FP8) evaluates the relevance of each token to the current query.
It then selects the Top-k tokens (e.g. 64–128) instead of all L tokens.
Full attention is computed only on this subset — effectively reducing compute from O(L²) → O(L × k).

Because k ≪ L, the cost grows linearly with sequence length. At 128K context, DSA requires only ~0.125 GB of KV memory (vs 67 GB for dense attention), a ~99.8 % memory reduction.

This allows long-context reasoning, multi-document summarization, or retrieval-augmented pipelines to run on a single 80 GB GPU with stable latency and flat cost growth.

This resulted in drastic reduction in GPU inference cost.

Deployment TL;DR (what finally worked)

Model: deepseek-ai/DeepSeek-V3.2-Exp (MoE; 163 shards; several 4.30 GB + a few 1.86 GB)
Runtime: vLLM (OpenAI-compatible)
Parallelism:
- Tried -dp 8 --enable-expert-parallel → hit NCCL/TCPStore “broken pipe” issues
Stable bring-up: -tp 8 (Tensor Parallel across 8 H200s)
Warmup: Long FP8 GEMM warmups + CUDA graph capture on first run (subsequent restarts are much faster due to cache)
Metrics: vLLM /metrics + Prometheus + Grafana (node_exporter + dcgm-exporter recommended)
Client validation: One-file OpenAI-compatible Python script; plus lm-eval for GSM8K
Grafana: Dashboard parameterized with $model_name = deepseek-ai/DeepSeek-V3.2-Exp
Cloud Provider : Shadeform/Datacrunch/Iceland
Total Cost : $54/2 hours

Details for Developers

Minimum Requirement

As per vLLM recipe book for Deepseek , recommended GPUs are B200 or H200.

Also, Python 3.12 with CUDA 13.

GPU Hunting Strategy

For quick and affordable GPU experiments, I usually rely on shadeform.ai or runpod.ai. Luckily, I had some shadeform.aicredits left, so I used them for this run — and the setup was surprisingly smooth.B200 Trial

First I tried to get B200 node , but I had issues in getting either the BM node available or some cases, could not get nvidia driver working.

shadeform@dawvygtc:~$ sudo  apt install cuda-driversReading package lists... DoneBuilding dependency tree... DoneReading state information... Donecuda-drivers is already the newest version (580.95.05-0ubuntu1).0 upgraded, 0 newly installed, 0 to remove and 165 not upgraded.shadeform@dawvygtc:~$ lspci | grep -i nvidia17:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)3d:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)60:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)70:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)98:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)bb:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)dd:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)ed:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)shadeform@dawvygtc:~$ nvidia-smiNo devices were foundshadeform@dawvygtc:~$

I could have troubleshooted, but didn’t want to pay $35/hour while I struggle with environment issues. Then I ended up killing the node and look for other node.

H200 + Ubuntu 24 + Nvidia Driver 580 — Worked

Because a full H200 node costs at least $25/hour, I didn’t want to spend time provisioning Ubuntu 22 and then upgrading to Python 3.12. Instead, I looked for an H200 image with Ubuntu 24 to reduce setup time. I ended up renting a DataCrunch H200 server in Iceland; on the first try, the Python and CUDA requirements lined up with minimal hassle, so I moved forward (it still wasn’t entirely smooth). I chose Ubuntu 24 with CUDA 12.8.

In order to get pytorch working, you need to follow exact version number. So for Nvidia driver 580, you should use CUDA 13.

# --- Install Python & pip ---
sudo apt install -y python3 python3-pip
pip install --upgrade pip

# --- Install uv package manager (optional, faster) 
---curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

# --- Create and activate virtual environment ---
uv venvsource .venv/bin/activate

# --- Install PyTorch nightly build with CUDA 13.0 support ---
uv pip install --pre torch torchvision --index-url [_https://download.pytorch.org/whl/nightly/cu130_](<https://download.pytorch.org/whl/nightly/cu130>) __
# Ensure below command return "True" in your Python terminal
import torch
[_torch.cuda.is_](<http://torch.cuda.is>) _available()

Once aforesaid commands are working, start installing vllm installation

# --- Install vLLM and dependencies ---
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightlyuv pip install [_https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl_](<https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl>)

# --- Install supporting Python libraries ---
uv pip install openai transformers accelerate numpy --quiet

# --- Verify vLLM environment ---
python -c "import torch, vllm, transformers, numpy; print('✅ Environment ready')"

System Validation script

python3 system_validation.py
======================================================================SYSTEM INFORMATION======================================================================

OS: Linux 6.8.0-79-generic
Python: 3.12.3
PyTorch: 2.8.0+cu128
CUDA available: True
CUDA version: 12.8
cuDNN version: 91002
Number of GPUs: 8

======================================================================GPU DETAILS======================================================================

GPU[0]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[1]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[2]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[3]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[4]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[5]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[6]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[7]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

Total GPU Memory: 1200.88 GB

======================================================================NVLINK STATUS======================================================================

✅ NVLink detected - Multi-GPU performance will be optimal

======================================================================CONFIGURATION RECOMMENDATIONS======================================================================

✅ Sufficient GPU memory for DeepSeek-V3.2-Exp   Recommended mode: EP/DP (--dp 8 --enable-expert-parallel)(shadeform)

Here is another catch, as per the vLLM official recipes, it recommends using Expert Parallelism + Data Parallelism (EP/DP), I would not recommend it for H200, unless you have extra time to troubleshoot EP/DP issues.

I would recommend using Tensor Parallel Mode (Fallback) for H200 single full node.

vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8

Downloading the model (what to expect)

DeepSeek-V3.2-Exp has a large number of shards (model-00001-of-000163.safetensors …). With 8 parallel downloads; each shard ~4.30 GB (some ~1.86 GB). With ~28–33 MB/s per stream, 8 at once gives ~220–260 MB/s aggregate (sar showed ~239 MB/s).

What the long warm-up logs mean

You’ll see long sequences like:

DeepGemm(fp8_gemm_nt) warmup (...) 8192/8192
DeepGemm(m_grouped_fp8_gemm_nt_contiguous) warmup (W=torch.Size([..., ..., ...]))
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE/FULL
vLLM / kernels are profiling & compiling FP8 GEMMs for many layer shapes.
MoE models do grouped GEMMs
CUDA Graphs are being captured for common prefill/decode paths to minimize runtime launch overhead.
The first start is the slowest. Compiled graphs and torch.compile artifacts are cached under:
~/.cache/vllm/torch_compile_cache//rank_*/backbon– subsequent restarts are much faster.

Maximum concurrency for 163,840 tokens per request: 5.04x

That’s vLLM telling you its KV-cache chunking math and how much intra-request parallelism it can achieve at that context length.

Common bring-up errors & fixes

Symptoms: TCPStore sendBytes... Broken pipe, Failed to check the "should dump" flag, API returns HTTP 500, server shuts down.

Usual causes & fixes:

A worker/rank died (OOM, kernel assert, unexpected shape) → All ranks try to talk to a dead TCPStore → broken pipe spam.
Mismatched parallelism vs GPU count → keep it simple: -tp 8 on 8 GPUs; only 1 form of parallelism while stabilizing.
No IB on the host? → export NCCL_IB_DISABLE=1
Kernel/driver hiccups → verify nvidia-smi is stable; check dmesg.
Don’t send traffic during warmup/graph capture ; wait until you see the final “All ranks ready”/Uvicorn up logs.

Metrics: Prometheus & exporters

You can simply deploy the Monitoring stack from the git repo

docker compose up -d

You should be able to access the Grafana UI on default user/password ( admin/admin)

http://<publicIP>:3000

You need to add Prometheus data source ( default) and then import the Grafana Dashboard JSON customized for Deepseek V.3.2

Now — Show time

If you see unicorn logs, you can start firing Tests and validation.Final Output

Zero-Shot Evaluation

 lm-eval --model local-completions --tasks gsm8k   --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=[_http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False_](<http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False>)

It could take few minutes to load all the tests

NFO 10-08 01:58:52 [__init__.py:224] Automatically detected platform cuda.2025-10-08:01:58:55 INFO     [__main__:446] Selected Tasks: ['gsm8k']2025-10-08:01:58:55 INFO     [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 12342025-10-08:01:58:55 INFO     [evaluator:240] Initializing local-completions model, with arguments: {'model': 'deepseek-ai/DeepSeek-V3.2-Exp', 'base_url':        'http://127.0.0.1:8000/v1/completions', 'num_concurrent': 100, 'max_retries': 3, 'tokenized_requests': False}2025-10-08:01:58:55 INFO     [models.api_models:170] Using max length 2048 - 12025-10-08:01:58:55 INFO     [models.api_models:189] Using tokenizer huggingfaceREADME.md: 7.94kB [00:00, 18.2MB/s]main/train-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:01<00:00, 1.86MB/s]main/test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419k/419k [00:00<00:00, 1.38MB/s]Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 342925.03 examples/s]Generating test split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 212698.46 examples/s]2025-10-08:01:59:02 INFO     [evaluator:305] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}2025-10-08:01:59:02 INFO     [api.task:434] Building contexts for gsm8k on rank 0...100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:03<00:00, 402.50it/s]2025-10-08:01:59:05 INFO     [evaluator:574] Running generate_until requests2025-10-08:01:59:05 INFO     [models.api_models:692] Tokenized requests are disabled. Context + generation length is not checked.Requesting API: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [04:55<00:00,  4.47it/s]fatal: not a git repository (or any of the parent directories): .git2025-10-08:02:04:03 INFO     [loggers.evaluation_tracker:280] Output path not provided, skipping saving results aggregatedlocal-completions (model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1|

Final result — which matches with the official doc

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr||-----|------:|----------------|-----:|-----------|---|-----:|---|-----:||gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9507|±  |0.0060||     |       |strict-match    |     5|exact_match|↑  |0.9484|±  |0.0061|

Few-Shot Evaluation (20 examples)

 lm-eval --model local-completions --tasks gsm8k   --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False --num_fewshot 20

Result looks pretty good

You can observe the Grafana dashboard for Analytics

Grafana vLLM dashboard for Deepseek

Grafana vLLM Dashboard - Deepseek V.3.2 Exp

DeepSeekNVIDIAH200LLMGPUInference