April 22, 2026 · 10 min read

vLLM vs TGI vs Triton on Kubernetes: Production LLM Serving Benchmark (2026)

Honest comparison of vLLM, Hugging Face TGI, and NVIDIA Triton with TensorRT-LLM for self-hosted LLM serving on Kubernetes: throughput, latency, GPU efficiency, operational complexity, and decision matrix for Llama 3.1, Qwen, and DeepSeek deployments.

vLLM vs TGI vs Triton on Kubernetes: Production LLM Serving Benchmark (2026)

The three serious choices for self-hosted LLM serving on Kubernetes in 2026 are vLLM, Hugging Face TGI (Text Generation Inference), and NVIDIA Triton with TensorRT-LLM. They’re aimed at different parts of the cost/complexity curve, and picking the wrong one burns weeks of wasted GPU hours.

This post benchmarks all three on identical hardware, explains why their relative performance varies by workload, and gives you a decision matrix based on what we’ve seen in production.

TL;DR decision matrix

If you need…Pick
Fastest path to production, broadest model supportvLLM
Tight integration with Hugging Face Hub, Inference Endpoints ergonomicsTGI
Highest throughput on NVIDIA H100/H200 with engineering budget to matchTriton + TRT-LLM
AMD MI300X or sovereign GPUvLLM (only real option)
Simple OpenAI-compatible API for LiteLLM integrationvLLM or TGI (both native; Triton needs adapter)
Multi-model serving on shared GPUTriton or vLLM with LoRA adapters
You’re experimenting and want to iteratevLLM

If you can’t articulate why you need Triton, use vLLM.

Benchmark setup

Hardware: 2 × NVIDIA H100 80GB on an EKS node (p5.4xlarge), NVMe local storage, 100 Gbps networking. Model: Llama 3.1 70B Instruct. Input: 1024 tokens. Output: 256 tokens. Concurrency sweep: 1, 8, 32, 128.

All three engines tuned by their respective best-practice guides, warm-up excluded. Numbers below are from our runs in Q1 2026; check the links to each project’s reference benchmarks for official numbers.

Throughput (tokens/sec, higher is better)

ConcurrencyvLLM 0.6.xTGI 2.4.xTriton + TRT-LLM 0.14
1555263
8410385510
322,8502,6203,950
1284,3003,9806,200

Time-to-first-token (ms, lower is better)

ConcurrencyvLLMTGITriton
1180195150
8220240175
32340380260
128620700410

Inter-token latency (ms/token, lower is better)

ConcurrencyvLLMTGITriton
117.818.215.1
3221.523.018.4
12826.028.521.5

Triton leads on every axis by 20-45%. That’s the honest headline. But the setup time for Triton was ~12 hours (TRT-LLM engine compile, config tuning, model repo setup) vs. under an hour for vLLM.

Operational complexity trade-off

The decision isn’t “fastest tokens per second” - it’s “fastest tokens per engineering-hour at your required scale”. Rough complexity index:

ConcernvLLMTGITriton + TRT-LLM
Initial deploy (YAML-to-serving)1 hour1 hour8-16 hours
Model swapPod restartPod restartRecompile TRT engine (~30-60 min)
Multi-tenant / LoRABuilt-in (LoRA adapters)Built-inComplex
Python in the hot pathYes (~10% overhead)Rust coreC++ core
ObservabilityPrometheus nativePrometheus nativePrometheus + custom metrics
Community velocityVery highHighModerate
Documentation clarityGoodGoodVariable
NVIDIA support commitmentCommunityHugging Face paid optionEnterprise contracts

If your team has strong C++/CUDA expertise and headcount to babysit the compile pipeline, Triton wins. If not, the engineering-hours cost exceeds the throughput savings until you’re at very high scale.

Deploying vLLM on Kubernetes

The simplest production-quality manifest:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-70b
  namespace: llm-serving
spec:
  replicas: 1                      # scale via KEDA
  selector:
    matchLabels: {app: vllm-llama-70b}
  template:
    metadata:
      labels: {app: vllm-llama-70b}
    spec:
      nodeSelector:
        node.kubernetes.io/gpu-family: h100
      tolerations:
        - key: "nvidia.com/gpu"
          operator: Exists
      containers:
        - name: vllm
          image: vllm/vllm-openai:v0.6.4
          args:
            - "--model=meta-llama/Llama-3.1-70B-Instruct"
            - "--tensor-parallel-size=2"
            - "--dtype=bfloat16"
            - "--max-model-len=16384"
            - "--gpu-memory-utilization=0.90"
            - "--enable-prefix-caching"
            - "--enable-chunked-prefill"
            - "--max-num-batched-tokens=8192"
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-token
                  key: token
          ports:
            - containerPort: 8000
          resources:
            requests:
              cpu: "8"
              memory: "64Gi"
              "nvidia.com/gpu": "2"
            limits:
              memory: "64Gi"
              "nvidia.com/gpu": "2"
          readinessProbe:
            httpGet: {path: /health, port: 8000}
            periodSeconds: 10
            failureThreshold: 30     # generous - model load takes time
          startupProbe:
            httpGet: {path: /health, port: 8000}
            periodSeconds: 10
            failureThreshold: 60     # up to 10 min for initial load
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
            - name: dshm
              mountPath: /dev/shm
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: vllm-model-cache
        - name: dshm
          emptyDir: {medium: Memory, sizeLimit: 8Gi}
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama-70b
  namespace: llm-serving
spec:
  selector: {app: vllm-llama-70b}
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP

Key knobs:

  • --tensor-parallel-size = number of GPUs the model is sharded across
  • --gpu-memory-utilization=0.90 gives KV cache room; 0.95 works but breaks on memory spikes
  • --enable-prefix-caching helps RAG workloads with shared system prompts
  • --max-model-len trades KV cache memory for supported context length
  • --max-num-batched-tokens=8192 caps how much prefill can pack into a single iteration

Model cache PVC prevents re-downloading 140 GB from Hugging Face on every pod restart.

Deploying TGI on Kubernetes

Near-identical topology:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tgi-llama-70b
  namespace: llm-serving
spec:
  replicas: 1
  selector:
    matchLabels: {app: tgi-llama-70b}
  template:
    metadata:
      labels: {app: tgi-llama-70b}
    spec:
      nodeSelector:
        node.kubernetes.io/gpu-family: h100
      tolerations:
        - key: "nvidia.com/gpu"
          operator: Exists
      containers:
        - name: tgi
          image: ghcr.io/huggingface/text-generation-inference:2.4.1
          args:
            - "--model-id=meta-llama/Llama-3.1-70B-Instruct"
            - "--num-shard=2"
            - "--dtype=bfloat16"
            - "--max-input-length=8192"
            - "--max-total-tokens=16384"
            - "--max-batch-prefill-tokens=8192"
          env:
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: huggingface-token
                  key: token
          ports:
            - containerPort: 80
          resources:
            requests: {cpu: 8, memory: 64Gi, "nvidia.com/gpu": 2}
            limits:   {memory: 64Gi, "nvidia.com/gpu": 2}
          volumeMounts:
            - {name: model-cache, mountPath: /data}
            - {name: dshm, mountPath: /dev/shm}
      volumes:
        - name: model-cache
          persistentVolumeClaim: {claimName: tgi-model-cache}
        - name: dshm
          emptyDir: {medium: Memory, sizeLimit: 8Gi}

TGI quirks to know:

  • /v1/chat/completions is OpenAI-compatible but a couple of OpenAI-specific fields (e.g., logit_bias pre-2.3) behave differently
  • Messaging-API and Completions-API are both supported; apps coupling to Completions may need work
  • Hugging Face Text Generation Inference 2.0+ added significant ROCm support for MI300X

Deploying Triton + TensorRT-LLM on Kubernetes

This is where complexity spikes. You need two steps: build the engine (CI job), then serve it (deployment).

Build step (runs on a GPU worker with the same GPU class as production):

apiVersion: batch/v1
kind: Job
metadata:
  name: build-trt-llm-engine-llama70b
  namespace: llm-build
spec:
  template:
    spec:
      restartPolicy: Never
      nodeSelector: {node.kubernetes.io/gpu-family: h100}
      tolerations:
        - key: "nvidia.com/gpu"
          operator: Exists
      containers:
        - name: build
          image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
          command: ["/bin/bash", "-c"]
          args:
            - |
              set -euo pipefail
              cd /workspace
              # 1. Convert HF checkpoint to TRT-LLM format
              python3 /opt/trtllm/examples/llama/convert_checkpoint.py \
                --model_dir /models/Llama-3.1-70B-Instruct \
                --output_dir /workspace/trtllm-ckpt \
                --dtype bfloat16 \
                --tp_size 2

              # 2. Build the engine
              trtllm-build \
                --checkpoint_dir /workspace/trtllm-ckpt \
                --output_dir /workspace/engines/llama70b \
                --gemm_plugin auto \
                --max_batch_size 64 \
                --max_input_len 8192 \
                --max_seq_len 16384 \
                --workers 2

              # 3. Copy to shared storage for the serving deployment
              aws s3 sync /workspace/engines/llama70b \
                s3://llm-engines/llama-3.1-70b-h100-tp2/
          resources:
            requests: {cpu: 16, memory: 120Gi, "nvidia.com/gpu": 2}
            limits:   {memory: 120Gi, "nvidia.com/gpu": 2}
          volumeMounts:
            - {name: model-src, mountPath: /models}
      volumes:
        - name: model-src
          persistentVolumeClaim: {claimName: hf-model-source}

Serving step:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-llama-70b
  namespace: llm-serving
spec:
  replicas: 1
  template:
    spec:
      nodeSelector: {node.kubernetes.io/gpu-family: h100}
      tolerations:
        - key: "nvidia.com/gpu"
          operator: Exists
      initContainers:
        - name: pull-engine
          image: amazon/aws-cli:2.17
          command: ["aws", "s3", "sync", "s3://llm-engines/llama-3.1-70b-h100-tp2/", "/engines/llama70b/"]
          volumeMounts:
            - {name: engine-vol, mountPath: /engines}
      containers:
        - name: triton
          image: nvcr.io/nvidia/tritonserver:24.10-trtllm-python-py3
          command: ["tritonserver"]
          args:
            - "--model-repository=/models"
            - "--grpc-port=8001"
            - "--http-port=8000"
            - "--metrics-port=8002"
            - "--log-verbose=1"
          ports:
            - {containerPort: 8000, name: http}
            - {containerPort: 8001, name: grpc}
            - {containerPort: 8002, name: metrics}
          resources:
            requests: {cpu: 8, memory: 64Gi, "nvidia.com/gpu": 2}
            limits:   {memory: 64Gi, "nvidia.com/gpu": 2}
          volumeMounts:
            - {name: engine-vol, mountPath: /models}
            - {name: dshm, mountPath: /dev/shm}
      volumes:
        - name: engine-vol
          emptyDir: {sizeLimit: 200Gi}
        - name: dshm
          emptyDir: {medium: Memory, sizeLimit: 8Gi}

Key gotchas:

  • Engine is GPU-family-specific. Building on H100 produces an engine that runs only on H100. Plan a CI matrix if you have mixed-GPU fleets.
  • Add the Triton OpenAI frontend (tensorrtllm_backend/examples/openai) or a LiteLLM-custom-provider wrapper to get OpenAI-compatible APIs.
  • Model loading is fast (engine is pre-compiled) - readiness within ~30 seconds post-init.

Autoscaling all three

Use KEDA with a Prometheus scaler on LLM-specific metrics:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-llama-70b
  namespace: llm-serving
spec:
  scaleTargetRef:
    name: vllm-llama-70b
  minReplicaCount: 1
  maxReplicaCount: 8
  pollingInterval: 30
  cooldownPeriod: 300
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-operated.monitoring:9090
        query: sum(vllm:num_requests_waiting)
        threshold: "32"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-operated.monitoring:9090
        query: histogram_quantile(0.95, sum(rate(vllm:e2e_request_latency_seconds_bucket[2m])) by (le))
        threshold: "5"

Metric names differ: TGI exposes tgi_request_inference_duration_seconds; Triton uses nv_inference_request_count and nv_inference_request_duration_us.

Don’t use CPU-based HPA. LLM serving pods idle at low CPU while the GPU is saturated.

GPU efficiency and cost

A useful lens: tokens per dollar at a given latency SLO. Rough numbers for Llama 3.1 70B at p95 inter-token latency ≤ 25ms on AWS me-central-1 EKS (p5.4xlarge, 2 × H100 80GB):

EngineMax concurrency at SLOTokens/secCost/hour (AED)Tokens per AED
vLLM963,800~150~91,000
TGI883,500~150~84,000
Triton1605,700~150~137,000

Triton wins throughput-per-dollar at steady state. vLLM wins when engineering time is costed in. TGI sits between.

Serving multi-model with LoRA adapters

For teams fine-tuning the same base model with multiple LoRA adapters:

  • vLLM supports multi-LoRA serving natively (--enable-lora --max-loras N). A single vLLM pod can serve many adapter variants. Best choice for LoRA-heavy workloads.
  • TGI supports multi-LoRA since 2.2 but with a lower max-adapter count by default.
  • Triton supports adapter merging but the deployment pattern is more involved; typically one engine per adapter.

GCC data sovereignty notes

All three engines run fine in sovereign GPU environments:

  • NVIDIA GPU families in-region: H100 on Azure UAE North, H200 on Core42, L40S/L4 for smaller models across multiple GCC providers
  • AMD MI300X in sovereign cloud: Core42’s UAE footprint and some KSA providers. vLLM only.
  • Model weights: mirror Hugging Face models to an in-region S3/Blob with SSE-KMS before deploying. Don’t let production pods pull from huggingface.co directly - both for availability and for traffic residency.

Integration with LiteLLM

Regardless of the engine, register it in your LiteLLM config as an OpenAI-compatible provider:

model_list:
  - model_name: llama-3.1-70b-selfhosted
    litellm_params:
      model: openai/meta-llama/Llama-3.1-70B-Instruct
      api_base: http://vllm-llama-70b.llm-serving.svc.cluster.local
      api_key: "dummy"         # self-hosted, but field is required
    model_info:
      mode: chat

Dify, orchestration services, or any OpenAI SDK client then calls llama-3.1-70b-selfhosted via LiteLLM with virtual-key enforcement. See our LiteLLM guide.

Observability

All three expose Prometheus metrics; the names differ. Unified dashboards we ship:

  • Throughput - tokens per second per replica (vllm: vllm:generation_tokens_total rate, tgi: tgi_generated_tokens counter, triton: custom from TRT-LLM backend)
  • Queue depth - *_waiting / *_queued equivalents; alert when > 100 for 1 minute
  • GPU utilization - DCGM exporter: DCGM_FI_DEV_GPU_UTIL. Anything under 70% at peak means the engine isn’t saturating the hardware; tune batch size.
  • Request latency - p50/p95/p99 TTFT and end-to-end
  • OOM / retry events - watch for KV cache exhaustion; if frequent, raise gpu_memory_utilization or reduce max_model_len

Common pitfalls

  • Pod scheduled to wrong GPU class - pod runs but at 3x expected latency. Use node labels and strict node selectors per engine deployment.
  • Model cache PVC not persisted - every pod restart pulls 140 GB from Hugging Face. Use a persistent cache volume, ideally one shared (ReadOnlyMany if supported).
  • max_model_len set too high - KV cache runs out of memory at moderate batch sizes. Tune based on your real prompt-length distribution, not the model’s theoretical max.
  • Tensor-parallel size mismatch - TP=2 on a single-GPU node silently works but with an internal fallback; check startup logs. Match TP to GPU count per node.
  • Triton engine versus runtime version skew - a TRT-LLM 0.12 engine won’t run on 0.14 runtime. Pin versions tightly.
  • Streaming responses dropping on ingress - some ingress controllers buffer SSE. Configure nginx.ingress.kubernetes.io/proxy-buffering: "off" for LLM service routes.

When to use which: our recommendation

For most GCC enterprise teams running self-hosted LLMs:

  1. Start with vLLM. Fastest to production, best community momentum, easiest to tune.
  2. Reach for TGI if you’re standardized on Hugging Face Enterprise or need Hugging Face’s evaluation ecosystem tightly integrated.
  3. Switch to Triton + TRT-LLM only once: (a) you’re past 2,000 sustained RPS on a single model, (b) you have a platform team that can own the compile pipeline, (c) the throughput difference meaningfully reduces GPU cost at your scale.

We’ve migrated clients in both directions. For a 2,500-RPS customer-service workload on H100, Triton saved ~40% GPU cost at the price of 2 extra engineering weeks to set up properly. For a low-traffic internal assistant, migrating to Triton was a waste.

What this connects to

Self-hosted LLM serving is the generation layer in a broader stack. See:

Getting help

NomadX operates self-hosted LLM serving for GCC enterprise teams on both NVIDIA and AMD sovereign clouds. If you want a benchmark against your real workload, a capacity model for your traffic pattern, or a production cutover from managed to self-hosted, our AI/ML Infrastructure on Kubernetes engagement is the starting point. Typical scope: 3-6 weeks, including model validation and load-test sign-off.

Frequently Asked Questions

Which LLM serving framework should I use on Kubernetes?

For most production deployments, vLLM is the right default: fastest to deploy, best OpenAI API compatibility, actively developed, broad model support. Use Hugging Face TGI when you're already in the Hugging Face Enterprise ecosystem or need their tooling around model validation and safety. Use NVIDIA Triton with TensorRT-LLM when you need the absolute highest throughput on NVIDIA hardware and can afford a 1-2x operational complexity premium, typically only justified above ~5,000 sustained requests per second or when a specific latency SLO requires TRT-LLM's optimizations.

How does TensorRT-LLM compare to vLLM on throughput?

TensorRT-LLM typically delivers 20-50% higher throughput than vLLM on equivalent NVIDIA hardware for dense models like Llama 3.1 70B, and can be 2x faster on specific workloads with speculative decoding or FP8 quantization. The cost is a model-compilation step that takes 10-60 minutes per model/GPU combination, and the resulting engine is locked to that specific GPU family and TensorRT-LLM version. vLLM requires no compilation and can switch models on restart.

Can I run vLLM, TGI, or Triton on AMD MI300X GPUs?

vLLM has first-class ROCm support on AMD MI300X since v0.4, and production deployments on MI300X are increasingly common in sovereign cloud (including GCC). TGI has ROCm support since v2.0 but fewer optimizations than vLLM on AMD. Triton with TensorRT-LLM is NVIDIA-only. If you're committed to AMD or a sovereign cloud with MI300X, vLLM is the practical choice.

How do I autoscale LLM serving pods on Kubernetes?

LLM serving pods cannot scale on CPU because the workload is GPU-bound. Use KEDA with a Prometheus scaler on a custom metric: request queue length, in-flight token count, or p95 latency. A working pattern: scale up when queue length exceeds 32 for 30 seconds, scale down when queue is empty for 5 minutes. Because pod startup takes 30-120 seconds (model load from disk or S3), over-provision by 20-40% rather than relying on reactive scaling for bursty traffic.

How much GPU memory does a 70B model need?

Llama 3.1 70B requires roughly 140 GB of GPU memory in FP16 for weights alone, plus 20-40 GB for the KV cache at moderate batch sizes, totaling ~160-180 GB. That fits on 2 × H100 80GB, 2 × H200 141GB (with room to spare), or 1 × H200 141GB with aggressive quantization. INT8 quantization halves the weight memory to ~70 GB, allowing 1 × H100 80GB deployments with reduced context length. FP8 on Hopper-class GPUs gives similar memory savings with better quality preservation.

How do I serve self-hosted LLMs to a LiteLLM gateway?

vLLM and TGI both expose OpenAI-compatible APIs out of the box. Deploy them with a ClusterIP service, then register them as OpenAI-compatible models in your LiteLLM proxy config. Triton requires a small adapter (or the Triton OpenAI-compatible frontend released in 2024) to match the OpenAI schema. Once registered, LiteLLM handles routing, fallback, virtual keys, and spend tracking across self-hosted and cloud models uniformly.

Get Started for Free

We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.

Talk to an Expert