All Articles
AI/ML

Deploying DeepSeek 3.2 Exp on NVIDIA H200 — Learning Lessons

C
Chandan Kumar
Founder, beCloudReady
October 7, 202510 min read
Deploying DeepSeek 3.2 Exp on NVIDIA H200 — Learning Lessons

Technical walkthrough and lessons learned from deploying DeepSeek 3.2 Exp on high-end NVIDIA H200 GPU infrastructure.

This is a hands-on log of getting DeepSeek-V3.2-Exp (MoE) running on a single H200 box with vLLM. It covers what worked, what didn’t, how long things actually took, how to monitor it, and a repeatable runbook you can reuse.

GitHub repo: https://github.com/torontoai-hub/torontoai-llm-lab/tree/main/deepseek-3.2-Exp

Before we deploy the model, lets take a peak into the key feauture and why there is so much excitecment about this new drop

DeepSeek V3.2 (Exp) — Sparse Attention, Memory Efficiency

DSA replaces full attention O(L²) with a two-stage pipeline :

  • Lightning Indexer Head — low-precision (FP8) attention that scores relevance for each token.

  • Top-k Token Selection — retains a small subset (e.g. k = 64–128).

  • Sparse Core Attention — performs dense attention only on selected tokens.

Dense Attention — O(L²) Cost Growth

LLM Dense Attention

Dense Attention

In dense attention, every token in the input sequence attends to every other token. For a context of length L , this requires computing and storing L × L attention weights.

  • Each token produces query (Q), key (K), and value (V) vectors.

  • The model calculates attention for every pair (Qᵢ, Kⱼ) in the sequence.

  • This leads to quadratic complexity in both compute and memory — O(L²).

At small context sizes (e.g. 2K–8K tokens), this is manageable, but at 128K tokens , both GPU memory and compute requirements explode. For example, 128K context at FP16 precision requires ~67 GB just for KV cache , exceeding single-GPU capacity.

Dense attention is accurate but inefficient for long sequences , because it spends most of its compute attending to irrelevant tokens.

Sparse Attention (DSA) — O(L × k) Scaling

LLM Deepseek Sparse Attention

Sparse attention, as implemented in DeepSeek Sparse Attention (DSA) , introduces a filtering stage that selects only the most relevant tokens before full attention is computed.

  • A Lightning Indexer Head (computed in FP8) evaluates the relevance of each token to the current query.

  • It then selects the Top-k tokens (e.g. 64–128) instead of all L tokens.

  • Full attention is computed only on this subset — effectively reducing compute from O(L²) → O(L × k).

Because k ≪ L, the cost grows linearly with sequence length. At 128K context, DSA requires only ~0.125 GB of KV memory (vs 67 GB for dense attention), a ~99.8 % memory reduction.

This allows long-context reasoning, multi-document summarization, or retrieval-augmented pipelines to run on a single 80 GB GPU with stable latency and flat cost growth.

This resulted in drastic reduction in GPU inference cost.

Deployment TL;DR (what finally worked)

  • Model: deepseek-ai/DeepSeek-V3.2-Exp (MoE; 163 shards; several 4.30 GB + a few 1.86 GB)

  • Runtime: vLLM (OpenAI-compatible)

  • Parallelism:

    • Tried -dp 8 --enable-expert-parallel → hit NCCL/TCPStore “broken pipe” issues
  • Stable bring-up: -tp 8 (Tensor Parallel across 8 H200s)

  • Warmup: Long FP8 GEMM warmups + CUDA graph capture on first run (subsequent restarts are much faster due to cache)

  • Metrics: vLLM /metrics + Prometheus + Grafana (node_exporter + dcgm-exporter recommended)

  • Client validation: One-file OpenAI-compatible Python script; plus lm-eval for GSM8K

  • Grafana: Dashboard parameterized with $model_name = deepseek-ai/DeepSeek-V3.2-Exp

  • Cloud Provider : Shadeform/Datacrunch/Iceland

  • Total Cost : $54/2 hours

Details for Developers

Minimum Requirement

As per vLLM recipe book for Deepseek , recommended GPUs are B200 or H200.

Also, Python 3.12 with CUDA 13.

GPU Hunting Strategy

For quick and affordable GPU experiments, I usually rely on shadeform.ai or runpod.ai. Luckily, I had some shadeform.aicredits left, so I used them for this run — and the setup was surprisingly smooth.B200 Trial

First I tried to get B200 node , but I had issues in getting either the BM node available or some cases, could not get nvidia driver working.

shadeform@dawvygtc:~$ sudo  apt install cuda-driversReading package lists... DoneBuilding dependency tree... DoneReading state information... Donecuda-drivers is already the newest version (580.95.05-0ubuntu1).0 upgraded, 0 newly installed, 0 to remove and 165 not upgraded.shadeform@dawvygtc:~$ lspci | grep -i nvidia17:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)3d:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)60:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)70:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)98:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)bb:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)dd:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)ed:00.0 3D controller: NVIDIA Corporation Device 2901 (rev a1)shadeform@dawvygtc:~$ nvidia-smiNo devices were foundshadeform@dawvygtc:~$

I could have troubleshooted, but didn’t want to pay $35/hour while I struggle with environment issues. Then I ended up killing the node and look for other node.

H200 + Ubuntu 24 + Nvidia Driver 580 — Worked

Because a full H200 node costs at least $25/hour, I didn’t want to spend time provisioning Ubuntu 22 and then upgrading to Python 3.12. Instead, I looked for an H200 image with Ubuntu 24 to reduce setup time. I ended up renting a DataCrunch H200 server in Iceland; on the first try, the Python and CUDA requirements lined up with minimal hassle, so I moved forward (it still wasn’t entirely smooth). I chose Ubuntu 24 with CUDA 12.8.

In order to get pytorch working, you need to follow exact version number. So for Nvidia driver 580, you should use CUDA 13.

# --- Install Python & pip ---
sudo apt install -y python3 python3-pip
pip install --upgrade pip

# --- Install uv package manager (optional, faster) 
---curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

# --- Create and activate virtual environment ---
uv venvsource .venv/bin/activate

# --- Install PyTorch nightly build with CUDA 13.0 support ---
uv pip install --pre torch torchvision --index-url [_https://download.pytorch.org/whl/nightly/cu130_](<https://download.pytorch.org/whl/nightly/cu130>) __
# Ensure below command return "True" in your Python terminal
import torch
[_torch.cuda.is_](<http://torch.cuda.is>) _available()

Once aforesaid commands are working, start installing vllm installation

# --- Install vLLM and dependencies ---
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightlyuv pip install [_https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl_](<https://wheels.vllm.ai/dsv32/deep_gemm-2.1.0%2B594953a-cp312-cp312-linux_x86_64.whl>)

# --- Install supporting Python libraries ---
uv pip install openai transformers accelerate numpy --quiet

# --- Verify vLLM environment ---
python -c "import torch, vllm, transformers, numpy; print('✅ Environment ready')"

System Validation script

python3 system_validation.py
======================================================================SYSTEM INFORMATION======================================================================

OS: Linux 6.8.0-79-generic
Python: 3.12.3
PyTorch: 2.8.0+cu128
CUDA available: True
CUDA version: 12.8
cuDNN version: 91002
Number of GPUs: 8

======================================================================GPU DETAILS======================================================================

GPU[0]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[1]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[2]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[3]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[4]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[5]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[6]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

GPU[7]:  Name: NVIDIA H200  Compute Capability: 9.0  Memory: 150.11 GB  Multi-Processors: 132  Status: ✅ Hopper architecture - Supported

Total GPU Memory: 1200.88 GB

======================================================================NVLINK STATUS======================================================================

✅ NVLink detected - Multi-GPU performance will be optimal

======================================================================CONFIGURATION RECOMMENDATIONS======================================================================

✅ Sufficient GPU memory for DeepSeek-V3.2-Exp   Recommended mode: EP/DP (--dp 8 --enable-expert-parallel)(shadeform) 

Here is another catch, as per the vLLM official recipes, it recommends using Expert Parallelism + Data Parallelism (EP/DP), I would not recommend it for H200, unless you have extra time to troubleshoot EP/DP issues.

I would recommend using Tensor Parallel Mode (Fallback) for H200 single full node.

vllm serve deepseek-ai/DeepSeek-V3.2-Exp -tp 8

Downloading the model (what to expect)

DeepSeek-V3.2-Exp has a large number of shards (model-00001-of-000163.safetensors …). With 8 parallel downloads; each shard ~4.30 GB (some ~1.86 GB). With ~28–33 MB/s per stream, 8 at once gives ~220–260 MB/s aggregate (sar showed ~239 MB/s).

What the long warm-up logs mean

You’ll see long sequences like:

  • DeepGemm(fp8_gemm_nt) warmup (...) 8192/8192

  • DeepGemm(m_grouped_fp8_gemm_nt_contiguous) warmup (W=torch.Size([..., ..., ...]))

  • Capturing CUDA graphs (mixed prefill-decode, PIECEWISE/FULL

  • vLLM / kernels are profiling & compiling FP8 GEMMs for many layer shapes.

  • MoE models do grouped GEMMs

  • CUDA Graphs are being captured for common prefill/decode paths to minimize runtime launch overhead.

  • The first start is the slowest. Compiled graphs and torch.compile artifacts are cached under:

  • ~/.cache/vllm/torch_compile_cache//rank_*/backbon– subsequent restarts are much faster.

    Maximum concurrency for 163,840 tokens per request: 5.04x

That’s vLLM telling you its KV-cache chunking math and how much intra-request parallelism it can achieve at that context length.

Common bring-up errors & fixes

Symptoms: TCPStore sendBytes... Broken pipe, Failed to check the "should dump" flag, API returns HTTP 500, server shuts down.

Usual causes & fixes:

  • A worker/rank died (OOM, kernel assert, unexpected shape) → All ranks try to talk to a dead TCPStore → broken pipe spam.

  • Mismatched parallelism vs GPU count → keep it simple: -tp 8 on 8 GPUs; only 1 form of parallelism while stabilizing.

  • No IB on the host? → export NCCL_IB_DISABLE=1

  • Kernel/driver hiccups → verify nvidia-smi is stable; check dmesg.

  • Don’t send traffic during warmup/graph capture ; wait until you see the final “All ranks ready”/Uvicorn up logs.

Metrics: Prometheus & exporters

You can simply deploy the Monitoring stack from the git repo

docker compose up -d

You should be able to access the Grafana UI on default user/password ( admin/admin)

http://<publicIP>:3000

You need to add Prometheus data source ( default) and then import the Grafana Dashboard JSON customized for Deepseek V.3.2

Now — Show time

If you see unicorn logs, you can start firing Tests and validation.Final Output

Zero-Shot Evaluation

 lm-eval --model local-completions --tasks gsm8k   --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=[_http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False_](<http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False>)

It could take few minutes to load all the tests

NFO 10-08 01:58:52 [__init__.py:224] Automatically detected platform cuda.2025-10-08:01:58:55 INFO     [__main__:446] Selected Tasks: ['gsm8k']2025-10-08:01:58:55 INFO     [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 12342025-10-08:01:58:55 INFO     [evaluator:240] Initializing local-completions model, with arguments: {'model': 'deepseek-ai/DeepSeek-V3.2-Exp', 'base_url':        'http://127.0.0.1:8000/v1/completions', 'num_concurrent': 100, 'max_retries': 3, 'tokenized_requests': False}2025-10-08:01:58:55 INFO     [models.api_models:170] Using max length 2048 - 12025-10-08:01:58:55 INFO     [models.api_models:189] Using tokenizer huggingfaceREADME.md: 7.94kB [00:00, 18.2MB/s]main/train-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:01<00:00, 1.86MB/s]main/test-00000-of-00001.parquet: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419k/419k [00:00<00:00, 1.38MB/s]Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 342925.03 examples/s]Generating test split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 212698.46 examples/s]2025-10-08:01:59:02 INFO     [evaluator:305] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}2025-10-08:01:59:02 INFO     [api.task:434] Building contexts for gsm8k on rank 0...100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:03<00:00, 402.50it/s]2025-10-08:01:59:05 INFO     [evaluator:574] Running generate_until requests2025-10-08:01:59:05 INFO     [models.api_models:692] Tokenized requests are disabled. Context + generation length is not checked.Requesting API: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [04:55<00:00,  4.47it/s]fatal: not a git repository (or any of the parent directories): .git2025-10-08:02:04:03 INFO     [loggers.evaluation_tracker:280] Output path not provided, skipping saving results aggregatedlocal-completions (model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1|

Final result — which matches with the official doc

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr||-----|------:|----------------|-----:|-----------|---|-----:|---|-----:||gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9507|±  |0.0060||     |       |strict-match    |     5|exact_match|↑  |0.9484|±  |0.0061|

Few-Shot Evaluation (20 examples)

 lm-eval --model local-completions --tasks gsm8k   --model_args model=deepseek-ai/DeepSeek-V3.2-Exp,base_url=http://127.0.0.1:8000/v1/completions,num_concurrent=100,max_retries=3,tokenized_requests=False --num_fewshot 20

Result looks pretty good

You can observe the Grafana dashboard for Analytics

Grafana vLLM dashboard for Deepseek

Grafana vLLM Dashboard - Deepseek V.3.2 Exp

DeepSeekNVIDIAH200LLMGPUInference