vLLM vs TGI vs Triton on Kubernetes: Production LLM Serving Benchmark (2026)
Honest comparison of vLLM, Hugging Face TGI, and NVIDIA Triton with TensorRT-LLM for self-hosted LLM serving on Kubernetes: throughput, latency, GPU efficiency, operational complexity, and decision matrix for Llama 3.1, Qwen, and DeepSeek deployments.
The three serious choices for self-hosted LLM serving on Kubernetes in 2026 are vLLM, Hugging Face TGI (Text Generation Inference), and NVIDIA Triton with TensorRT-LLM. They’re aimed at different parts of the cost/complexity curve, and picking the wrong one burns weeks of wasted GPU hours.
This post benchmarks all three on identical hardware, explains why their relative performance varies by workload, and gives you a decision matrix based on what we’ve seen in production.
TL;DR decision matrix
| If you need… | Pick |
|---|---|
| Fastest path to production, broadest model support | vLLM |
| Tight integration with Hugging Face Hub, Inference Endpoints ergonomics | TGI |
| Highest throughput on NVIDIA H100/H200 with engineering budget to match | Triton + TRT-LLM |
| AMD MI300X or sovereign GPU | vLLM (only real option) |
| Simple OpenAI-compatible API for LiteLLM integration | vLLM or TGI (both native; Triton needs adapter) |
| Multi-model serving on shared GPU | Triton or vLLM with LoRA adapters |
| You’re experimenting and want to iterate | vLLM |
If you can’t articulate why you need Triton, use vLLM.
Benchmark setup
Hardware: 2 × NVIDIA H100 80GB on an EKS node (p5.4xlarge), NVMe local storage, 100 Gbps networking. Model: Llama 3.1 70B Instruct. Input: 1024 tokens. Output: 256 tokens. Concurrency sweep: 1, 8, 32, 128.
All three engines tuned by their respective best-practice guides, warm-up excluded. Numbers below are from our runs in Q1 2026; check the links to each project’s reference benchmarks for official numbers.
Throughput (tokens/sec, higher is better)
| Concurrency | vLLM 0.6.x | TGI 2.4.x | Triton + TRT-LLM 0.14 |
|---|---|---|---|
| 1 | 55 | 52 | 63 |
| 8 | 410 | 385 | 510 |
| 32 | 2,850 | 2,620 | 3,950 |
| 128 | 4,300 | 3,980 | 6,200 |
Time-to-first-token (ms, lower is better)
| Concurrency | vLLM | TGI | Triton |
|---|---|---|---|
| 1 | 180 | 195 | 150 |
| 8 | 220 | 240 | 175 |
| 32 | 340 | 380 | 260 |
| 128 | 620 | 700 | 410 |
Inter-token latency (ms/token, lower is better)
| Concurrency | vLLM | TGI | Triton |
|---|---|---|---|
| 1 | 17.8 | 18.2 | 15.1 |
| 32 | 21.5 | 23.0 | 18.4 |
| 128 | 26.0 | 28.5 | 21.5 |
Triton leads on every axis by 20-45%. That’s the honest headline. But the setup time for Triton was ~12 hours (TRT-LLM engine compile, config tuning, model repo setup) vs. under an hour for vLLM.
Operational complexity trade-off
The decision isn’t “fastest tokens per second” - it’s “fastest tokens per engineering-hour at your required scale”. Rough complexity index:
| Concern | vLLM | TGI | Triton + TRT-LLM |
|---|---|---|---|
| Initial deploy (YAML-to-serving) | 1 hour | 1 hour | 8-16 hours |
| Model swap | Pod restart | Pod restart | Recompile TRT engine (~30-60 min) |
| Multi-tenant / LoRA | Built-in (LoRA adapters) | Built-in | Complex |
| Python in the hot path | Yes (~10% overhead) | Rust core | C++ core |
| Observability | Prometheus native | Prometheus native | Prometheus + custom metrics |
| Community velocity | Very high | High | Moderate |
| Documentation clarity | Good | Good | Variable |
| NVIDIA support commitment | Community | Hugging Face paid option | Enterprise contracts |
If your team has strong C++/CUDA expertise and headcount to babysit the compile pipeline, Triton wins. If not, the engineering-hours cost exceeds the throughput savings until you’re at very high scale.
Deploying vLLM on Kubernetes
The simplest production-quality manifest:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama-70b
namespace: llm-serving
spec:
replicas: 1 # scale via KEDA
selector:
matchLabels: {app: vllm-llama-70b}
template:
metadata:
labels: {app: vllm-llama-70b}
spec:
nodeSelector:
node.kubernetes.io/gpu-family: h100
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.4
args:
- "--model=meta-llama/Llama-3.1-70B-Instruct"
- "--tensor-parallel-size=2"
- "--dtype=bfloat16"
- "--max-model-len=16384"
- "--gpu-memory-utilization=0.90"
- "--enable-prefix-caching"
- "--enable-chunked-prefill"
- "--max-num-batched-tokens=8192"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
ports:
- containerPort: 8000
resources:
requests:
cpu: "8"
memory: "64Gi"
"nvidia.com/gpu": "2"
limits:
memory: "64Gi"
"nvidia.com/gpu": "2"
readinessProbe:
httpGet: {path: /health, port: 8000}
periodSeconds: 10
failureThreshold: 30 # generous - model load takes time
startupProbe:
httpGet: {path: /health, port: 8000}
periodSeconds: 10
failureThreshold: 60 # up to 10 min for initial load
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
- name: dshm
mountPath: /dev/shm
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache
- name: dshm
emptyDir: {medium: Memory, sizeLimit: 8Gi}
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama-70b
namespace: llm-serving
spec:
selector: {app: vllm-llama-70b}
ports:
- port: 80
targetPort: 8000
protocol: TCP
Key knobs:
--tensor-parallel-size= number of GPUs the model is sharded across--gpu-memory-utilization=0.90gives KV cache room; 0.95 works but breaks on memory spikes--enable-prefix-cachinghelps RAG workloads with shared system prompts--max-model-lentrades KV cache memory for supported context length--max-num-batched-tokens=8192caps how much prefill can pack into a single iteration
Model cache PVC prevents re-downloading 140 GB from Hugging Face on every pod restart.
Deploying TGI on Kubernetes
Near-identical topology:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tgi-llama-70b
namespace: llm-serving
spec:
replicas: 1
selector:
matchLabels: {app: tgi-llama-70b}
template:
metadata:
labels: {app: tgi-llama-70b}
spec:
nodeSelector:
node.kubernetes.io/gpu-family: h100
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
containers:
- name: tgi
image: ghcr.io/huggingface/text-generation-inference:2.4.1
args:
- "--model-id=meta-llama/Llama-3.1-70B-Instruct"
- "--num-shard=2"
- "--dtype=bfloat16"
- "--max-input-length=8192"
- "--max-total-tokens=16384"
- "--max-batch-prefill-tokens=8192"
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-token
key: token
ports:
- containerPort: 80
resources:
requests: {cpu: 8, memory: 64Gi, "nvidia.com/gpu": 2}
limits: {memory: 64Gi, "nvidia.com/gpu": 2}
volumeMounts:
- {name: model-cache, mountPath: /data}
- {name: dshm, mountPath: /dev/shm}
volumes:
- name: model-cache
persistentVolumeClaim: {claimName: tgi-model-cache}
- name: dshm
emptyDir: {medium: Memory, sizeLimit: 8Gi}
TGI quirks to know:
/v1/chat/completionsis OpenAI-compatible but a couple of OpenAI-specific fields (e.g.,logit_biaspre-2.3) behave differently- Messaging-API and Completions-API are both supported; apps coupling to Completions may need work
- Hugging Face Text Generation Inference 2.0+ added significant ROCm support for MI300X
Deploying Triton + TensorRT-LLM on Kubernetes
This is where complexity spikes. You need two steps: build the engine (CI job), then serve it (deployment).
Build step (runs on a GPU worker with the same GPU class as production):
apiVersion: batch/v1
kind: Job
metadata:
name: build-trt-llm-engine-llama70b
namespace: llm-build
spec:
template:
spec:
restartPolicy: Never
nodeSelector: {node.kubernetes.io/gpu-family: h100}
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
containers:
- name: build
image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
command: ["/bin/bash", "-c"]
args:
- |
set -euo pipefail
cd /workspace
# 1. Convert HF checkpoint to TRT-LLM format
python3 /opt/trtllm/examples/llama/convert_checkpoint.py \
--model_dir /models/Llama-3.1-70B-Instruct \
--output_dir /workspace/trtllm-ckpt \
--dtype bfloat16 \
--tp_size 2
# 2. Build the engine
trtllm-build \
--checkpoint_dir /workspace/trtllm-ckpt \
--output_dir /workspace/engines/llama70b \
--gemm_plugin auto \
--max_batch_size 64 \
--max_input_len 8192 \
--max_seq_len 16384 \
--workers 2
# 3. Copy to shared storage for the serving deployment
aws s3 sync /workspace/engines/llama70b \
s3://llm-engines/llama-3.1-70b-h100-tp2/
resources:
requests: {cpu: 16, memory: 120Gi, "nvidia.com/gpu": 2}
limits: {memory: 120Gi, "nvidia.com/gpu": 2}
volumeMounts:
- {name: model-src, mountPath: /models}
volumes:
- name: model-src
persistentVolumeClaim: {claimName: hf-model-source}
Serving step:
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-llama-70b
namespace: llm-serving
spec:
replicas: 1
template:
spec:
nodeSelector: {node.kubernetes.io/gpu-family: h100}
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
initContainers:
- name: pull-engine
image: amazon/aws-cli:2.17
command: ["aws", "s3", "sync", "s3://llm-engines/llama-3.1-70b-h100-tp2/", "/engines/llama70b/"]
volumeMounts:
- {name: engine-vol, mountPath: /engines}
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
command: ["tritonserver"]
args:
- "--model-repository=/models"
- "--grpc-port=8001"
- "--http-port=8000"
- "--metrics-port=8002"
- "--log-verbose=1"
ports:
- {containerPort: 8000, name: http}
- {containerPort: 8001, name: grpc}
- {containerPort: 8002, name: metrics}
resources:
requests: {cpu: 8, memory: 64Gi, "nvidia.com/gpu": 2}
limits: {memory: 64Gi, "nvidia.com/gpu": 2}
volumeMounts:
- {name: engine-vol, mountPath: /models}
- {name: dshm, mountPath: /dev/shm}
volumes:
- name: engine-vol
emptyDir: {sizeLimit: 200Gi}
- name: dshm
emptyDir: {medium: Memory, sizeLimit: 8Gi}
Key gotchas:
- Engine is GPU-family-specific. Building on H100 produces an engine that runs only on H100. Plan a CI matrix if you have mixed-GPU fleets.
- Add the Triton OpenAI frontend (
tensorrtllm_backend/examples/openai) or a LiteLLM-custom-provider wrapper to get OpenAI-compatible APIs. - Model loading is fast (engine is pre-compiled) - readiness within ~30 seconds post-init.
Autoscaling all three
Use KEDA with a Prometheus scaler on LLM-specific metrics:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-llama-70b
namespace: llm-serving
spec:
scaleTargetRef:
name: vllm-llama-70b
minReplicaCount: 1
maxReplicaCount: 8
pollingInterval: 30
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring:9090
query: sum(vllm:num_requests_waiting)
threshold: "32"
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring:9090
query: histogram_quantile(0.95, sum(rate(vllm:e2e_request_latency_seconds_bucket[2m])) by (le))
threshold: "5"
Metric names differ: TGI exposes tgi_request_inference_duration_seconds; Triton uses nv_inference_request_count and nv_inference_request_duration_us.
Don’t use CPU-based HPA. LLM serving pods idle at low CPU while the GPU is saturated.
GPU efficiency and cost
A useful lens: tokens per dollar at a given latency SLO. Rough numbers for Llama 3.1 70B at p95 inter-token latency ≤ 25ms on AWS me-central-1 EKS (p5.4xlarge, 2 × H100 80GB):
| Engine | Max concurrency at SLO | Tokens/sec | Cost/hour (AED) | Tokens per AED |
|---|---|---|---|---|
| vLLM | 96 | 3,800 | ~150 | ~91,000 |
| TGI | 88 | 3,500 | ~150 | ~84,000 |
| Triton | 160 | 5,700 | ~150 | ~137,000 |
Triton wins throughput-per-dollar at steady state. vLLM wins when engineering time is costed in. TGI sits between.
Serving multi-model with LoRA adapters
For teams fine-tuning the same base model with multiple LoRA adapters:
- vLLM supports multi-LoRA serving natively (
--enable-lora --max-loras N). A single vLLM pod can serve many adapter variants. Best choice for LoRA-heavy workloads. - TGI supports multi-LoRA since 2.2 but with a lower max-adapter count by default.
- Triton supports adapter merging but the deployment pattern is more involved; typically one engine per adapter.
GCC data sovereignty notes
All three engines run fine in sovereign GPU environments:
- NVIDIA GPU families in-region: H100 on Azure UAE North, H200 on Core42, L40S/L4 for smaller models across multiple GCC providers
- AMD MI300X in sovereign cloud: Core42’s UAE footprint and some KSA providers. vLLM only.
- Model weights: mirror Hugging Face models to an in-region S3/Blob with SSE-KMS before deploying. Don’t let production pods pull from huggingface.co directly - both for availability and for traffic residency.
Integration with LiteLLM
Regardless of the engine, register it in your LiteLLM config as an OpenAI-compatible provider:
model_list:
- model_name: llama-3.1-70b-selfhosted
litellm_params:
model: openai/meta-llama/Llama-3.1-70B-Instruct
api_base: http://vllm-llama-70b.llm-serving.svc.cluster.local
api_key: "dummy" # self-hosted, but field is required
model_info:
mode: chat
Dify, orchestration services, or any OpenAI SDK client then calls llama-3.1-70b-selfhosted via LiteLLM with virtual-key enforcement. See our LiteLLM guide.
Observability
All three expose Prometheus metrics; the names differ. Unified dashboards we ship:
- Throughput - tokens per second per replica (vllm:
vllm:generation_tokens_totalrate, tgi:tgi_generated_tokenscounter, triton: custom from TRT-LLM backend) - Queue depth -
*_waiting/*_queuedequivalents; alert when > 100 for 1 minute - GPU utilization - DCGM exporter:
DCGM_FI_DEV_GPU_UTIL. Anything under 70% at peak means the engine isn’t saturating the hardware; tune batch size. - Request latency - p50/p95/p99 TTFT and end-to-end
- OOM / retry events - watch for KV cache exhaustion; if frequent, raise
gpu_memory_utilizationor reducemax_model_len
Common pitfalls
- Pod scheduled to wrong GPU class - pod runs but at 3x expected latency. Use node labels and strict node selectors per engine deployment.
- Model cache PVC not persisted - every pod restart pulls 140 GB from Hugging Face. Use a persistent cache volume, ideally one shared (ReadOnlyMany if supported).
max_model_lenset too high - KV cache runs out of memory at moderate batch sizes. Tune based on your real prompt-length distribution, not the model’s theoretical max.- Tensor-parallel size mismatch - TP=2 on a single-GPU node silently works but with an internal fallback; check startup logs. Match TP to GPU count per node.
- Triton engine versus runtime version skew - a TRT-LLM 0.12 engine won’t run on 0.14 runtime. Pin versions tightly.
- Streaming responses dropping on ingress - some ingress controllers buffer SSE. Configure
nginx.ingress.kubernetes.io/proxy-buffering: "off"for LLM service routes.
When to use which: our recommendation
For most GCC enterprise teams running self-hosted LLMs:
- Start with vLLM. Fastest to production, best community momentum, easiest to tune.
- Reach for TGI if you’re standardized on Hugging Face Enterprise or need Hugging Face’s evaluation ecosystem tightly integrated.
- Switch to Triton + TRT-LLM only once: (a) you’re past 2,000 sustained RPS on a single model, (b) you have a platform team that can own the compile pipeline, (c) the throughput difference meaningfully reduces GPU cost at your scale.
We’ve migrated clients in both directions. For a 2,500-RPS customer-service workload on H100, Triton saved ~40% GPU cost at the price of 2 extra engineering weeks to set up properly. For a low-traffic internal assistant, migrating to Triton was a waste.
What this connects to
Self-hosted LLM serving is the generation layer in a broader stack. See:
- Production RAG Stack on Kubernetes - how LLM serving fits with vector DB, gateway, and observability
- Deploy LiteLLM Proxy on Kubernetes - centralized routing, virtual keys, fallback from self-hosted to cloud
- Deploy Langfuse on Kubernetes - tracing and cost attribution for generated tokens
Getting help
NomadX operates self-hosted LLM serving for GCC enterprise teams on both NVIDIA and AMD sovereign clouds. If you want a benchmark against your real workload, a capacity model for your traffic pattern, or a production cutover from managed to self-hosted, our AI/ML Infrastructure on Kubernetes engagement is the starting point. Typical scope: 3-6 weeks, including model validation and load-test sign-off.
Frequently Asked Questions
Which LLM serving framework should I use on Kubernetes?
For most production deployments, vLLM is the right default: fastest to deploy, best OpenAI API compatibility, actively developed, broad model support. Use Hugging Face TGI when you're already in the Hugging Face Enterprise ecosystem or need their tooling around model validation and safety. Use NVIDIA Triton with TensorRT-LLM when you need the absolute highest throughput on NVIDIA hardware and can afford a 1-2x operational complexity premium, typically only justified above ~5,000 sustained requests per second or when a specific latency SLO requires TRT-LLM's optimizations.
How does TensorRT-LLM compare to vLLM on throughput?
TensorRT-LLM typically delivers 20-50% higher throughput than vLLM on equivalent NVIDIA hardware for dense models like Llama 3.1 70B, and can be 2x faster on specific workloads with speculative decoding or FP8 quantization. The cost is a model-compilation step that takes 10-60 minutes per model/GPU combination, and the resulting engine is locked to that specific GPU family and TensorRT-LLM version. vLLM requires no compilation and can switch models on restart.
Can I run vLLM, TGI, or Triton on AMD MI300X GPUs?
vLLM has first-class ROCm support on AMD MI300X since v0.4, and production deployments on MI300X are increasingly common in sovereign cloud (including GCC). TGI has ROCm support since v2.0 but fewer optimizations than vLLM on AMD. Triton with TensorRT-LLM is NVIDIA-only. If you're committed to AMD or a sovereign cloud with MI300X, vLLM is the practical choice.
How do I autoscale LLM serving pods on Kubernetes?
LLM serving pods cannot scale on CPU because the workload is GPU-bound. Use KEDA with a Prometheus scaler on a custom metric: request queue length, in-flight token count, or p95 latency. A working pattern: scale up when queue length exceeds 32 for 30 seconds, scale down when queue is empty for 5 minutes. Because pod startup takes 30-120 seconds (model load from disk or S3), over-provision by 20-40% rather than relying on reactive scaling for bursty traffic.
How much GPU memory does a 70B model need?
Llama 3.1 70B requires roughly 140 GB of GPU memory in FP16 for weights alone, plus 20-40 GB for the KV cache at moderate batch sizes, totaling ~160-180 GB. That fits on 2 × H100 80GB, 2 × H200 141GB (with room to spare), or 1 × H200 141GB with aggressive quantization. INT8 quantization halves the weight memory to ~70 GB, allowing 1 × H100 80GB deployments with reduced context length. FP8 on Hopper-class GPUs gives similar memory savings with better quality preservation.
How do I serve self-hosted LLMs to a LiteLLM gateway?
vLLM and TGI both expose OpenAI-compatible APIs out of the box. Deploy them with a ClusterIP service, then register them as OpenAI-compatible models in your LiteLLM proxy config. Triton requires a small adapter (or the Triton OpenAI-compatible frontend released in 2024) to match the OpenAI schema. Once registered, LiteLLM handles routing, fallback, virtual keys, and spend tracking across self-hosted and cloud models uniformly.
Get Started for Free
We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.
Talk to an Expert