vLLM Benchmarking & LLM Inference Optimization with NVIDIA GenAI-Perf
A practical guide to setting up vLLM inference, benchmarking with NVIDIA GenAI-Perf, and building an observability stack using Prometheus and Grafana.
After spending hours dealing with ChatGPT hallucinations, I finally had to do a Google search to find the right tool for LLM inference benchmarking. It turns out NVIDIA has done a great job creating a robust tool that can be used across different platforms, including Triton and OpenAI-compatible APIs.
LLM benchmarking can be confusing, as people often mix up LLM performance testing with benchmarking. Performance testing validates the overall capacity of your server infrastructure, including network latency, CPU performance, and other system-level throughputs. Benchmarking tools, on the other hand, primarily focus on LLM inference engine–specific parameters, which are critical if you are planning to run your own inference platform — something most enterprises are now focusing on.
This is a series of blogs that I will be writing as I go through the process of learning and experimenting with vLLM-based inference solutions, along with insights from real-world use cases operating LLM inference platforms in enterprise environments.
Here are some of the most common inference use cases.
In this example we will be setting up a single node Inference + benchmarking node for experimentation purpose, however, production use case would require the Benchmarking tool should run from a separate node.
Prerequisites
For decent benchmarking, you need the following to get started:
-
NVIDIA GPU–powered compute platform. This can be your desktop, or you can use any of the Neo Cloud providers. My obvious preference is Denvr Cloud. Feel free to sign up — https://www.denvr.com/
-
Hugging Face login. Sign up for a free Hugging Face account. You'll need it to download models and access gated models such as Meta Llama and others.
-
LLM-labs repo. https://github.com/kchandan/llm-labs
Step-by-step guide
To install the necessary packages on the Linux VM (e.g., NVIDIA drivers, Docker, etc.), the easiest approach is to update the IP address in the Ansible inventory file and then let the playbook handle the full installation.
cat llmops/ansible/inventory/hosts.ini
; [vllm_server]
; server_name ansible_user=ubuntu
[llm_workers]
<IP Address> ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/<your_key_file>Once the IP address is updated, fire the Ansible playbook to install required packages:
(venv) ➜ llmops git:(main) ✗ ansible-playbook -i ansible/inventory/hosts.ini ansible/setup_worker.yml
PLAY [Setup worker nodes] **********************************************************************************************************************************************
TASK [Gathering Facts] *************************************************************************************************************************************************
[WARNING]: Host is using the discovered Python interpreter at '/usr/bin/python3.12', but future installation of another Python interpreter could cause a different interpreter to be discovered.
ok: [worker-node]
TASK [docker_install : Update apt and install prerequisites] ***********************************************************************************************************
ok: [worker-node]
TASK [docker_install : Create directory for Docker keyrings] ***********************************************************************************************************
ok: [worker-node]
TASK [docker_install : Download Docker GPG key] ************************************************************************************************************************
ok: [worker-node]
TASK [docker_install : Add Docker repository to apt sources] ***********************************************************************************************************
changed: [worker-node]
TASK [docker_install : Update apt cache after adding Docker repo] ******************************************************************************************************
changed: [worker-node]
TASK [docker_install : Install Docker packages] ************************************************************************************************************************
ok: [worker-node]
TASK [docker_install : Ensure Docker service is enabled and started] ***************************************************************************************************
ok: [worker-node]
TASK [docker_install : Add ubuntu user to docker group] ****************************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Download cuda-keyring deb] **********************************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Install cuda-keyring deb (dpkg)] ****************************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : apt update] *************************************************************************************************************************************
changed: [worker-node]
TASK [nvidia-toolkit : Install cuda-drivers] ***************************************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Install prerequisites] **************************************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Create keyring directory if missing] ************************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Download NVIDIA container toolkit GPG key] ******************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Convert GPG key to dearmor format] **************************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Add NVIDIA container toolkit apt repository] ****************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Enable experimental repository (optional)] ******************************************************************************************************
skipping: [worker-node]
TASK [nvidia-toolkit : Update apt cache after repo add] ****************************************************************************************************************
changed: [worker-node]
TASK [nvidia-toolkit : Install NVIDIA Container Toolkit packages] ******************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Configure NVIDIA Docker runtime] ****************************************************************************************************************
ok: [worker-node]
TASK [nvidia-toolkit : Restart Docker] *********************************************************************************************************************************
changed: [worker-node]
PLAY RECAP *************************************************************************************************************************************************************
worker-node : ok=22 changed=5 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
Post installation, ensure the driver installation looks good:
ubuntu@llmops:~/llm-labs$ nvidia-smi
Sun Jan 11 21:53:01 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01 Driver Version: 590.48.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-40GB Off | 00000000:0A:00.0 Off | 0 |
| N/A 47C P0 50W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
Create the common docker bridge network so that all containers can talk to each other (default bridge driver):
docker network create llmops-netExport the Hugging Face token:
export HF_TOKEN=hf_tokenNow, simply launch the vLLM docker compose — it will take some time to load:
ubuntu@llmops:~/llm-labs/llmops/vllm$ docker compose -f docker-compose-vllm-qwen3-0.6B.yml up -d
[+] up 1/1
✔ Container vllm Created 0.3s
ubuntu@llmops:~/llm-labs/llmops/vllm$ docker compose -f docker-compose.monitoring.yml up -d
WARN[0000] Found orphan containers ([vllm]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.
✔ Container prometheus Created 0.5s
✔ Container dcgm-exporter Created 0.5s
✔ Container node-exporter Created 0.5s
✔ Container cadvisor Created 0.5s
✔ Container grafana CreatedIgnore the orphan container warning. I have kept those 2 compose files as separate deliverables so that more model-specific compose files could be added later into the same repo.
Once all containers are downloaded and loaded, it should look like this (without container crash loop):
ubuntu@llmops:~/llm-labs/llmops/vllm$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
750f8e14201d grafana/grafana:latest "/run.sh" 58 seconds ago Up 58 seconds 0.0.0.0:3000->3000/tcp, [::]:3000->3000/tcp grafana
270c865726e9 prom/prometheus:latest "/bin/prometheus --c…" 59 seconds ago Up 58 seconds 0.0.0.0:9090->9090/tcp, [::]:9090->9090/tcp prometheus
f679c2313fd2 gcr.io/cadvisor/cadvisor:latest "/usr/bin/cadvisor -…" 59 seconds ago Up 58 seconds (healthy) 0.0.0.0:8080->8080/tcp, [::]:8080->8080/tcp cadvisor
28873c028c0b prom/node-exporter:latest "/bin/node_exporter …" 59 seconds ago Up 58 seconds 0.0.0.0:9100->9100/tcp, [::]:9100->9100/tcp node-exporter
5e3f54b8f485 nvidia/dcgm-exporter:latest "/usr/local/dcgm/dcg…" 59 seconds ago Up 58 seconds 0.0.0.0:9400->9400/tcp, [::]:9400->9400/tcp dcgm-exporter
3b002c0b1d47 vllm/vllm-openai:latest "vllm serve --model …" About a minute ago Up About a minute 0.0.0.0:8000->8000/tcp, [::]:8000->8000/tcp vllm
Installing NVIDIA GenAI-Perf
Now we have set up the vLLM inference base, the next step is to install NVIDIA GenAI-Perf:
pip install genai-perfDo a quick test run to see if everything is working:
genai-perf profile \
-m Qwen/Qwen3-0.6B \
--endpoint-type chat \
--synthetic-input-tokens-mean 200 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 100 \
--output-tokens-stddev 0 \
--streaming \
--request-count 50 \
--warmup-request-count 10You should see output like this:
[2026-01-11 23:53:27] DEBUG Inferred tokenizer from model name: Qwen/Qwen3-0.6B
[2026-01-11 23:53:27] INFO Profiling these models: Qwen/Qwen3-0.6B
[2026-01-11 23:53:27] INFO Model name 'Qwen/Qwen3-0.6B' cannot be used to create artifact directory. Instead, 'Qwen_Qwen3-0.6B' will be used.
[2026-01-11 23:53:27] INFO Creating tokenizer for: Qwen/Qwen3-0.6B
[2026-01-11 23:53:29] INFO Running Perf Analyzer : 'perf_analyzer -m Qwen/Qwen3-0.6B --async --warmup-request-count 10 --stability-percentage 999 --request-count 50 -i http --concurrency-range 1 --service-kind openai --endpoint v1/chat/completions --input-data artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/inputs.json --profile-export-file artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/profile_export.json'
[2026-01-11 23:53:52] INFO Loading response data from 'artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/profile_export.json'
[2026-01-11 23:53:52] INFO Parsing total 50 requests.
Progress: 100%|████████████████████████████████████████████| 50/50 [00:00<00:00, 260.92requests/s]
NVIDIA GenAI-Perf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Statistic ┃ avg ┃ min ┃ max ┃ p99 ┃ p90 ┃ p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ Time To First Token (ms) │ 12.79 │ 11.14 │ 16.74 │ 15.22 │ 13.30 │ 13.05 │
│ Time To Second Token (ms) │ 3.18 │ 3.06 │ 3.73 │ 3.57 │ 3.27 │ 3.24 │
│ Request Latency (ms) │ 336.79 │ 324.87 │ 348.00 │ 347.84 │ 346.32 │ 345.02 │
│ Inter Token Latency (ms) │ 3.27 │ 3.17 │ 3.39 │ 3.39 │ 3.37 │ 3.36 │
│ Output Token Throughput Per User │ 305.64 │ 295.21 │ 315.82 │ 315.69 │ 312.30 │ 311.15 │
│ (tokens/sec/user) │ │ │ │ │ │ │
│ Output Sequence Length (tokens) │ 99.98 │ 99.00 │ 100.00 │ 100.00 │ 100.00 │ 100.00 │
│ Input Sequence Length (tokens) │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │
│ Output Token Throughput (tokens/sec) │ 296.71 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Throughput (per sec) │ 2.97 │ N/A │ N/A │ N/A │ N/A │ N/A │
│ Request Count (count) │ 50.00 │ N/A │ N/A │ N/A │ N/A │ N/A │
└──────────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘
[2026-01-11 23:53:52] INFO Generating artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/profile_export_genai_perf.json
[2026-01-11 23:53:52] INFO Generating artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/profile_export_genai_perf.csv
If you are able to see these metrics from GenAI-Perf, it means your setup is complete.
Setting up the Grafana Dashboard
Now let's move on to setting up the Grafana dashboard.
First, ensure that you have configured the Prometheus backend in Grafana. By default, it points to localhost, so we need to switch it to prometheus, matching the service name used in the Docker Compose file.
As part of the Docker Compose setup, Grafana should automatically pick up the dashboard (NVIDIA DCGM + vLLM).
You should now be able to see the metrics flowing into the Grafana dashboard.
What's Next
At this point, what we have achieved is a basic "hello-world" setup for our LLM benchmarking infrastructure. The next big challenge is to benchmark properly and identify how we can tweak vLLM parameters and GenAI-Perf settings to squeeze the maximum out of the hardware. In this example, I am using a single A100-40GB GPU. It may not sound like much, but these are very powerful cards and work extremely well for agentic workflows where small language models are heavily used.
The next blog will focus more on capturing additional metrics and logs, and on how to get the best out of your hardware.
If you are looking to collaborate, please check out — https://www.becloudready.com/dev-rel-services