TL;DR — Ship an OpenAI-compatible vLLM service on K8s, flip on continuous batching + PagedAttention, scale with KEDA on business metrics, and keep SREs calm with real SLOs.
As an Amazon Associate I earn from qualifying purchases.
Table of Contents
- Production Diagram
- Prereqs
- Deploy vLLM (Helm)
- Probes & No Cold Starts
- Autoscaling with KEDA
- Tuning Knobs
- Observability
- Progressive Delivery
- Cost Math
- Client Example
- Pre-flight Checklist
- Appendix: Manifests
- FAQ
The (sane) production diagram
[ Clients ] ──HTTP/2 or gRPC──> [ Gateway/Ingress ]
| (route by model/SLA/region)
v
[ vLLM Deployments ]
(continuous batching,
PagedAttention, TP if needed)
|
+---------------------+----------------------+
| |
[ GPU Nodes ] [ Observability ]
(NVIDIA device plugin, MIG) Prometheus + Grafana + OTel
DCGM exporter ---> Prom (RED + GPU + queue metrics)
- vLLM gives you OpenAI-compatible endpoints, continuous batching & PagedAttention.
- NVIDIA device plugin exposes
nvidia.com/gpu; enable MIG on A100/H100 if you want steady multi-tenant capacity. - KEDA scales on Prometheus metrics you actually care about (RPS, queue time, tokens/sec).
- DCGM exporter sends GPU util & memory to Prometheus for Grafana dashboards.
Implementation steps
Step 0 — Prereqs (yes, the boring but critical part)
- Install NVIDIA device plugin (or GPU Operator to also manage drivers/DCGM/MIG).
- (Optional) Enable MIG to partition big GPUs for isolation and predictable concurrency.
- Install DCGM exporter for GPU metrics scraping.
# Device plugin (Helm)
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin -n nvidia-device-plugin --create-namespace
# DCGM exporter (Helm)
helm repo add nvidia-dcgm https://nvidia.github.io/dcgm-exporter
helm upgrade -i dcgm-exporter nvidia-dcgm/dcgm-exporter -n gpu-telemetry --create-namespace
Step 1 — Deploy vLLM (Helm quick path)
helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install vllm vllm/vllm-stack -f values-vllm.yaml -n vllm --create-namespace
values-vllm.yaml (starter)
vllm:
model: meta-llama/Llama-3.1-8B-Instruct
dtype: auto
tensorParallelSize: 1
gpuMemoryUtilization: 0.90
maxModelLen: 8192
service:
type: ClusterIP
port: 8000
resources:
limits: { nvidia.com/gpu: 1, cpu: "4", memory: "16Gi" }
requests: { nvidia.com/gpu: 1, cpu: "2", memory: "12Gi" }
Step 2 — Health, readiness, and “no cold starts, please”
Use a startup probe that tolerates model load time and a readiness probe that flips green only after the server is ready.
readinessProbe:
httpGet: { path: /health, port: 8000 }
periodSeconds: 5
initialDelaySeconds: 20
startupProbe:
httpGet: { path: /health, port: 8000 }
periodSeconds: 5
failureThreshold: 60
Step 3 — Scale on business signals (not just CPU)
Scale replicas based on RPS/queue time/tokens per second via KEDA → Prometheus scaler.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: vllm-rps }
spec:
scaleTargetRef: { name: vllm }
pollingInterval: 5
cooldownPeriod: 60
minReplicaCount: 1
maxReplicaCount: 40
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: http_requests_total
threshold: "8" # target RPS per pod
query: rate(http_requests_total{app="vllm"}[1m])
One scaler per workload; don’t let HPA and KEDA arm-wrestle over the same Deployment.
Step 4 — Tuning the vLLM engine (knobs that actually matter)
--gpu-memory-utilization=0.90as a starting point; back off if OOM.- Continuous batching is your best friend under bursty traffic.
- Route short vs. long prompts separately to avoid head-of-line blocking.
- Multi-tenant? Use MIG for isolation.
Step 5 — Observability that pays rent
Scrape vLLM /metrics + DCGM exporter. Key queries (Grafana):
- RPS
sum(rate(http_requests_total{app="vllm"}[5m])) - P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="vllm"}[5m])) by (le)) - GPU Utilization
avg(DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"}) by (gpu) - GPU Memory (GB)
avg(DCGM_FI_DEV_FB_USED{job="dcgm-exporter"}) by (gpu) / 1024
If GPU util < 40% and P95 is grumpy, your batching or routing is slacking.
Step 6 — Progressive delivery (so rollouts don’t wake up on-call)
Use your traffic manager (Ingress/Service Mesh/Argo Rollouts) to do 5% → 25% → 50% → 100% canaries with guards on error rate and P95. Keep maxUnavailable: 0.
Step 7 — Cost math that fits on a napkin
Let C_gpu = $/hour per GPU pod, T_out = tokens/sec per pod.
Cost per 1K tokens ≈ (C_gpu / (T_out * 3600)) * 1000.
Push T_out up with safe batching; watch P95 and GPU mem.
Client: call it like OpenAI (but it’s your cluster)
from openai import OpenAI
client = OpenAI(base_url="https://your.vllm.endpoint/v1", api_key="not-used-or-any-string")
resp = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role":"user","content":"Write a haiku about gophers and GPUs."}],
temperature=0.6,
stream=True
)
for chunk in resp:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Pre-flight checklist (tape this to your monitor)
- NVIDIA device plugin healthy; pods can request
nvidia.com/gpu. - (Optional) MIG slices configured for isolation.
- Startup vs. readiness probes separate; no cold traffic.
- Only one autoscaler controls a given Deployment.
- Prometheus scrapes vLLM + DCGM; Grafana shows RPS/P95/GPU util/mem.
- SLOs defined (e.g., P95 ≤ 60 ms, error rate < 0.5%) with alerts.
Appendix — Drop-in manifests (edit & apply)
A) Bare-bones vLLM Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm
labels: { app: vllm }
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate: { maxUnavailable: 0, maxSurge: 1 }
selector:
matchLabels: { app: vllm }
template:
metadata:
labels: { app: vllm }
spec:
nodeSelector:
nvidia.com/gpu.present: "true"
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- --model=meta-llama/Llama-3.1-8B-Instruct
- --dtype=auto
- --gpu-memory-utilization=0.90
- --max-model-len=8192
ports: [{ containerPort: 8000, name: http }]
resources:
limits: { "nvidia.com/gpu": 1, cpu: "4", memory: "16Gi" }
requests: { "nvidia.com/gpu": 1, cpu: "2", memory: "12Gi" }
readinessProbe:
httpGet: { path: /health, port: http }
periodSeconds: 5
initialDelaySeconds: 20
startupProbe:
httpGet: { path: /health, port: http }
periodSeconds: 5
failureThreshold: 60
---
apiVersion: v1
kind: Service
metadata: { name: vllm-svc }
spec:
selector: { app: vllm }
ports: [{ port: 8000, targetPort: http }]
type: ClusterIP
B) KEDA scale by Prometheus RPS
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: vllm-rps }
spec:
scaleTargetRef: { name: vllm }
pollingInterval: 5
cooldownPeriod: 60
minReplicaCount: 1
maxReplicaCount: 40
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: http_requests_total
threshold: "8"
query: rate(http_requests_total{app="vllm"}[1m])
FAQ
Q1. Why vLLM over other servers?
Continuous batching + PagedAttention give great throughput while keeping tail latency sane. OpenAI-compatible API is icing on the cake.
Q2. Do I need MIG?
If you’re multi-tenant or want steady concurrency on A100/H100, MIG is your friend. Single-tenant, steady load? Optional.
Q3. What should I scale on—CPU or RPS?
RPS/queue time/tokens per second. CPU rarely tells the real story for GPU-bound inference.
Q4. How do I avoid cold starts?
Startup vs. readiness probes, pre-pull weights, keep a small warm pool during rollouts.
Q5. Any quick cost lever?
Batching and routing. Route short/long prompts separately; keep GPU util healthy without sacrificing P95.