Run vLLM on Kubernetes: Cut P95 Latency to 60 ms

TL;DR — Ship an OpenAI-compatible vLLM service on K8s, flip on continuous batching + PagedAttention, scale with KEDA on business metrics, and keep SREs calm with real SLOs.

As an Amazon Associate I earn from qualifying purchases.

Production Diagram
Prereqs
Deploy vLLM (Helm)
Probes & No Cold Starts
Autoscaling with KEDA
Tuning Knobs
Observability
Progressive Delivery
Cost Math
Client Example
Pre-flight Checklist
Appendix: Manifests
FAQ

The (sane) production diagram

[ Clients ]  ──HTTP/2 or gRPC──>  [ Gateway/Ingress ]
                                      |  (route by model/SLA/region)
                                      v
                              [ vLLM Deployments ]
                             (continuous batching,
                               PagedAttention, TP if needed)
                                      |
                +---------------------+----------------------+
                |                                            |
        [ GPU Nodes ]                                 [ Observability ]
   (NVIDIA device plugin, MIG)                Prometheus + Grafana + OTel
         DCGM exporter ---> Prom             (RED + GPU + queue metrics)

[ Clients ]  ──HTTP/2 or gRPC──>  [ Gateway/Ingress ]
                                      |  (route by model/SLA/region)
                                      v
                              [ vLLM Deployments ]
                             (continuous batching,
                               PagedAttention, TP if needed)
                                      |
                +---------------------+----------------------+
                |                                            |
        [ GPU Nodes ]                                 [ Observability ]
   (NVIDIA device plugin, MIG)                Prometheus + Grafana + OTel
         DCGM exporter ---> Prom             (RED + GPU + queue metrics)

vLLM gives you OpenAI-compatible endpoints, continuous batching & PagedAttention.
NVIDIA device plugin exposes nvidia.com/gpu; enable MIG on A100/H100 if you want steady multi-tenant capacity.
KEDA scales on Prometheus metrics you actually care about (RPS, queue time, tokens/sec).
DCGM exporter sends GPU util & memory to Prometheus for Grafana dashboards.

Implementation steps

Step 0 — Prereqs (yes, the boring but critical part)

Install NVIDIA device plugin (or GPU Operator to also manage drivers/DCGM/MIG).
(Optional) Enable MIG to partition big GPUs for isolation and predictable concurrency.
Install DCGM exporter for GPU metrics scraping.

# Device plugin (Helm)
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin -n nvidia-device-plugin --create-namespace

# DCGM exporter (Helm)
helm repo add nvidia-dcgm https://nvidia.github.io/dcgm-exporter
helm upgrade -i dcgm-exporter nvidia-dcgm/dcgm-exporter -n gpu-telemetry --create-namespace

# Device plugin (Helm)
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin -n nvidia-device-plugin --create-namespace

# DCGM exporter (Helm)
helm repo add nvidia-dcgm https://nvidia.github.io/dcgm-exporter
helm upgrade -i dcgm-exporter nvidia-dcgm/dcgm-exporter -n gpu-telemetry --create-namespace

Step 1 — Deploy vLLM (Helm quick path)

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install vllm vllm/vllm-stack -f values-vllm.yaml -n vllm --create-namespace

helm repo add vllm https://vllm-project.github.io/production-stack
helm repo update
helm install vllm vllm/vllm-stack -f values-vllm.yaml -n vllm --create-namespace

values-vllm.yaml (starter)

vllm:
  model: meta-llama/Llama-3.1-8B-Instruct
  dtype: auto
  tensorParallelSize: 1
  gpuMemoryUtilization: 0.90
  maxModelLen: 8192
  service:
    type: ClusterIP
    port: 8000
  resources:
    limits:   { nvidia.com/gpu: 1, cpu: "4",  memory: "16Gi" }
    requests: { nvidia.com/gpu: 1, cpu: "2",  memory: "12Gi" }

vllm:
  model: meta-llama/Llama-3.1-8B-Instruct
  dtype: auto
  tensorParallelSize: 1
  gpuMemoryUtilization: 0.90
  maxModelLen: 8192
  service:
    type: ClusterIP
    port: 8000
  resources:
    limits:   { nvidia.com/gpu: 1, cpu: "4",  memory: "16Gi" }
    requests: { nvidia.com/gpu: 1, cpu: "2",  memory: "12Gi" }

Step 2 — Health, readiness, and “no cold starts, please”

Use a startup probe that tolerates model load time and a readiness probe that flips green only after the server is ready.

readinessProbe:
  httpGet: { path: /health, port: 8000 }
  periodSeconds: 5
  initialDelaySeconds: 20
startupProbe:
  httpGet: { path: /health, port: 8000 }
  periodSeconds: 5
  failureThreshold: 60

readinessProbe:
  httpGet: { path: /health, port: 8000 }
  periodSeconds: 5
  initialDelaySeconds: 20
startupProbe:
  httpGet: { path: /health, port: 8000 }
  periodSeconds: 5
  failureThreshold: 60

Step 3 — Scale on business signals (not just CPU)

Scale replicas based on RPS/queue time/tokens per second via KEDA → Prometheus scaler.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: vllm-rps }
spec:
  scaleTargetRef: { name: vllm }
  pollingInterval: 5
  cooldownPeriod: 60
  minReplicaCount: 1
  maxReplicaCount: 40
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_requests_total
      threshold: "8"                      # target RPS per pod
      query: rate(http_requests_total{app="vllm"}[1m])

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: vllm-rps }
spec:
  scaleTargetRef: { name: vllm }
  pollingInterval: 5
  cooldownPeriod: 60
  minReplicaCount: 1
  maxReplicaCount: 40
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_requests_total
      threshold: "8"                      # target RPS per pod
      query: rate(http_requests_total{app="vllm"}[1m])

One scaler per workload; don’t let HPA and KEDA arm-wrestle over the same Deployment.

Step 4 — Tuning the vLLM engine (knobs that actually matter)

--gpu-memory-utilization=0.90 as a starting point; back off if OOM.
Continuous batching is your best friend under bursty traffic.
Route short vs. long prompts separately to avoid head-of-line blocking.
Multi-tenant? Use MIG for isolation.

Step 5 — Observability that pays rent

Scrape vLLM /metrics + DCGM exporter. Key queries (Grafana):

RPS
sum(rate(http_requests_total{app="vllm"}[5m]))
P95 Latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{app="vllm"}[5m])) by (le))
GPU Utilization
avg(DCGM_FI_DEV_GPU_UTIL{job="dcgm-exporter"}) by (gpu)
GPU Memory (GB)
avg(DCGM_FI_DEV_FB_USED{job="dcgm-exporter"}) by (gpu) / 1024

If GPU util < 40% and P95 is grumpy, your batching or routing is slacking.

Step 6 — Progressive delivery (so rollouts don’t wake up on-call)

Use your traffic manager (Ingress/Service Mesh/Argo Rollouts) to do 5% → 25% → 50% → 100% canaries with guards on error rate and P95. Keep maxUnavailable: 0.

Step 7 — Cost math that fits on a napkin

Let C_gpu = $/hour per GPU pod, T_out = tokens/sec per pod.
Cost per 1K tokens ≈ (C_gpu / (T_out * 3600)) * 1000.
Push T_out up with safe batching; watch P95 and GPU mem.

Client: call it like OpenAI (but it’s your cluster)

from openai import OpenAI
client = OpenAI(base_url="https://your.vllm.endpoint/v1", api_key="not-used-or-any-string")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role":"user","content":"Write a haiku about gophers and GPUs."}],
    temperature=0.6,
    stream=True
)
for chunk in resp:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

from openai import OpenAI
client = OpenAI(base_url="https://your.vllm.endpoint/v1", api_key="not-used-or-any-string")

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role":"user","content":"Write a haiku about gophers and GPUs."}],
    temperature=0.6,
    stream=True
)
for chunk in resp:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Pre-flight checklist (tape this to your monitor)

NVIDIA device plugin healthy; pods can request nvidia.com/gpu.
(Optional) MIG slices configured for isolation.
Startup vs. readiness probes separate; no cold traffic.
Only one autoscaler controls a given Deployment.
Prometheus scrapes vLLM + DCGM; Grafana shows RPS/P95/GPU util/mem.
SLOs defined (e.g., P95 ≤ 60 ms, error rate < 0.5%) with alerts.

Appendix — Drop-in manifests (edit & apply)

A) Bare-bones vLLM Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  labels: { app: vllm }
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxUnavailable: 0, maxSurge: 1 }
  selector:
    matchLabels: { app: vllm }
  template:
    metadata:
      labels: { app: vllm }
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model=meta-llama/Llama-3.1-8B-Instruct
          - --dtype=auto
          - --gpu-memory-utilization=0.90
          - --max-model-len=8192
        ports: [{ containerPort: 8000, name: http }]
        resources:
          limits:   { "nvidia.com/gpu": 1, cpu: "4", memory: "16Gi" }
          requests: { "nvidia.com/gpu": 1, cpu: "2", memory: "12Gi" }
        readinessProbe:
          httpGet: { path: /health, port: http }
          periodSeconds: 5
          initialDelaySeconds: 20
        startupProbe:
          httpGet: { path: /health, port: http }
          periodSeconds: 5
          failureThreshold: 60
---
apiVersion: v1
kind: Service
metadata: { name: vllm-svc }
spec:
  selector: { app: vllm }
  ports: [{ port: 8000, targetPort: http }]
  type: ClusterIP

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm
  labels: { app: vllm }
spec:
  replicas: 1
  strategy:
    type: RollingUpdate
    rollingUpdate: { maxUnavailable: 0, maxSurge: 1 }
  selector:
    matchLabels: { app: vllm }
  template:
    metadata:
      labels: { app: vllm }
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
          - --model=meta-llama/Llama-3.1-8B-Instruct
          - --dtype=auto
          - --gpu-memory-utilization=0.90
          - --max-model-len=8192
        ports: [{ containerPort: 8000, name: http }]
        resources:
          limits:   { "nvidia.com/gpu": 1, cpu: "4", memory: "16Gi" }
          requests: { "nvidia.com/gpu": 1, cpu: "2", memory: "12Gi" }
        readinessProbe:
          httpGet: { path: /health, port: http }
          periodSeconds: 5
          initialDelaySeconds: 20
        startupProbe:
          httpGet: { path: /health, port: http }
          periodSeconds: 5
          failureThreshold: 60
---
apiVersion: v1
kind: Service
metadata: { name: vllm-svc }
spec:
  selector: { app: vllm }
  ports: [{ port: 8000, targetPort: http }]
  type: ClusterIP

B) KEDA scale by Prometheus RPS

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: vllm-rps }
spec:
  scaleTargetRef: { name: vllm }
  pollingInterval: 5
  cooldownPeriod: 60
  minReplicaCount: 1
  maxReplicaCount: 40
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_requests_total
      threshold: "8"
      query: rate(http_requests_total{app="vllm"}[1m])

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: vllm-rps }
spec:
  scaleTargetRef: { name: vllm }
  pollingInterval: 5
  cooldownPeriod: 60
  minReplicaCount: 1
  maxReplicaCount: 40
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: http_requests_total
      threshold: "8"
      query: rate(http_requests_total{app="vllm"}[1m])

FAQ

Q1. Why vLLM over other servers?
Continuous batching + PagedAttention give great throughput while keeping tail latency sane. OpenAI-compatible API is icing on the cake.

Q2. Do I need MIG?
If you’re multi-tenant or want steady concurrency on A100/H100, MIG is your friend. Single-tenant, steady load? Optional.

Q3. What should I scale on—CPU or RPS?
RPS/queue time/tokens per second. CPU rarely tells the real story for GPU-bound inference.

Q4. How do I avoid cold starts?
Startup vs. readiness probes, pre-pull weights, keep a small warm pool during rollouts.

Q5. Any quick cost lever?
Batching and routing. Route short/long prompts separately; keep GPU util healthy without sacrificing P95.