Deploy LiteLLM Proxy on Kubernetes: Enterprise LLM Gateway Guide (2026)
Run LiteLLM as a production LLM gateway on Kubernetes: virtual keys, per-team budgets, provider fallbacks, Redis caching, Postgres persistence, Langfuse tracing, and GCC-sovereign multi-provider routing across OpenAI, Azure, Bedrock, and Vertex.
Every enterprise rolling out an LLM program hits the same problem around month three: engineering teams have proliferated provider accounts. One team is on OpenAI with a corporate card, another got Azure OpenAI via Enterprise Agreement, a third is hitting Bedrock directly from a Lambda, and nobody can tell finance which team spent what. Add in the GCC data-residency requirements and the situation compounds fast.
LiteLLM proxy on Kubernetes is the fix. It’s a single OpenAI-compatible endpoint that fronts every provider, enforces virtual-key budgets, and hands clean cost data to finance. This guide is the production deployment we use with clients.
Architecture
┌────────────────────────────────┐
Client App ────────▶│ LiteLLM Proxy (FastAPI) │
(OpenAI SDK) │ Deployment: 3+ replicas │
sk-litellm-abc123 └───┬────────────────────┬───────┘
│ │
Virtual key│ Model lookup, budget check,
lookup │ rate limit, rewrite to provider
▼ │
┌────────────┐ │
│ Postgres │ │
│ (keys, │ │
│ budgets, │ │
│ spend) │ │
└────────────┘ │
│
┌────────────┐ │
│ Redis │◀──────────────┘
│ (cache, │ Cache hits, rate limits,
│ RL state) │ streaming coordination
└────────────┘
│
┌───────────────────┬──────────────┼──────────────┬──────────────┐
▼ ▼ ▼ ▼ ▼
Azure OpenAI Bedrock OpenAI Anthropic Self-hosted
(UAE North) (me-south-1) (US) (US) vLLM (cluster)
│ │ │ │ │
└───────────────────┴──────────────┴──────────────┴──────────────┘
│
▼
┌──────────┐
│ Langfuse │ Traces, costs, evals
└──────────┘
Invariants:
- LiteLLM proxy is stateless. Scale horizontally with plain HPA on request rate or CPU.
- Postgres is the policy database. Holds virtual keys, teams, budgets, spend aggregates. Must be HA.
- Redis is optional but recommended. Required for prompt caching and distributed rate limiting.
- Provider keys live in Kubernetes secrets, mounted into the proxy as env vars.
- Callbacks to Langfuse/DataDog/S3 run async after the response returns, so they don’t add user-facing latency.
Prerequisites
kubectl version --client # 1.28+
helm version # 3.14+
Dependencies you need provisioned:
- Postgres 14+ with a dedicated
litellmdatabase - Redis 7+ (single instance or Sentinel)
- Provider API keys in your secrets backend (AWS Secrets Manager, Azure Key Vault, Vault)
- A Langfuse instance if you want integrated tracing - see our Langfuse guide
Helm install
Add the official chart:
helm repo add litellm https://berriai.github.io/litellm-helm
helm repo update
kubectl create namespace llm-gateway
Production values.yaml:
# values.prod.yaml
image:
repository: ghcr.io/berriai/litellm-database
tag: "v1.55.3-stable" # use the -stable tag in prod
pullPolicy: IfNotPresent
replicaCount: 3
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
memory: "4Gi"
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
podDisruptionBudget:
enabled: true
minAvailable: 2
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: litellm
masterkey: null # must come from secret
masterkeySecretName: litellm-master
masterkeySecretKey: master-key
db:
deployStandalone: false
useExisting: true
endpoint: "litellm-pg-rw.data.svc.cluster.local"
database: "litellm"
url: null # built from env secret
useStackgresCluster: false
redis:
enabled: false # using external Redis
host: "litellm-redis-master.data.svc.cluster.local"
port: 6379
envVars:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: litellm-db-creds
key: database-url
- name: REDIS_HOST
value: "litellm-redis-master.data.svc.cluster.local"
- name: REDIS_PORT
value: "6379"
- name: REDIS_PASSWORD
valueFrom:
secretKeyRef:
name: litellm-redis-creds
key: password
- name: STORE_MODEL_IN_DB
value: "True"
- name: UI_USERNAME
value: "admin"
- name: UI_PASSWORD
valueFrom:
secretKeyRef:
name: litellm-ui-creds
key: password
- name: LITELLM_LOG
value: "INFO"
- name: LITELLM_DISABLE_VERSION_CHECK
value: "true"
# Provider keys (examples - adapt to your providers)
- name: AZURE_OPENAI_API_KEY_UAE
valueFrom:
secretKeyRef:
name: llm-provider-keys
key: azure-uae-north
- name: AZURE_OPENAI_ENDPOINT_UAE
value: "https://mycompany-uae.openai.azure.com"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: llm-provider-keys
key: aws-access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: llm-provider-keys
key: aws-secret-key
- name: AWS_REGION_NAME
value: "me-south-1"
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: llm-provider-keys
key: anthropic
# Langfuse integration
- name: LANGFUSE_PUBLIC_KEY
valueFrom:
secretKeyRef:
name: langfuse-integration
key: public-key
- name: LANGFUSE_SECRET_KEY
valueFrom:
secretKeyRef:
name: langfuse-integration
key: secret-key
- name: LANGFUSE_HOST
value: "https://langfuse.example.ae"
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/proxy-body-size: "10m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
hosts:
- host: llm.example.ae
paths:
- path: /
pathType: Prefix
tls:
- secretName: litellm-tls
hosts: [llm.example.ae]
serviceMonitor:
enabled: true
namespace: monitoring
labels:
release: kube-prometheus-stack
# The config.yaml passed to the proxy. This is where the real policy lives.
proxy_config:
model_list:
- model_name: gpt-4o-uae-primary
litellm_params:
model: azure/gpt-4o
api_base: os.environ/AZURE_OPENAI_ENDPOINT_UAE
api_key: os.environ/AZURE_OPENAI_API_KEY_UAE
api_version: "2024-08-01-preview"
model_info:
mode: chat
- model_name: claude-sonnet-bedrock-me
litellm_params:
model: bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0
aws_region_name: os.environ/AWS_REGION_NAME
- model_name: claude-sonnet-fallback
litellm_params:
model: anthropic/claude-3-5-sonnet-20241022
api_key: os.environ/ANTHROPIC_API_KEY
# Model groups with fallbacks
router_settings:
routing_strategy: latency-based-routing
fallbacks:
- gpt-4o-uae-primary: [claude-sonnet-bedrock-me]
- claude-sonnet-bedrock-me: [claude-sonnet-fallback]
redis_host: os.environ/REDIS_HOST
redis_port: os.environ/REDIS_PORT
redis_password: os.environ/REDIS_PASSWORD
num_retries: 2
timeout: 60
allowed_fails: 3
cooldown_time: 30
# Global settings
litellm_settings:
drop_params: true
success_callback: ["langfuse"]
failure_callback: ["langfuse"]
cache: true
cache_params:
type: redis
host: os.environ/REDIS_HOST
port: os.environ/REDIS_PORT
password: os.environ/REDIS_PASSWORD
ttl: 600
# Only cache prompts marked explicitly
supported_call_types: ["acompletion", "aembedding"]
general_settings:
master_key: os.environ/LITELLM_MASTER_KEY
database_url: os.environ/DATABASE_URL
store_model_in_db: true
alerting: ["slack"]
alerting_threshold: 300
budget_duration: 30d
max_budget: 50000 # USD per month org-wide hard cap
# Prevents key leakage in logs
redact_messages_in_exceptions: true
Install:
helm upgrade --install litellm litellm/litellm \
--namespace llm-gateway \
--values values.prod.yaml \
--version 0.5.0 \
--wait --timeout 10m
Check the proxy is healthy:
kubectl exec -n llm-gateway deploy/litellm -- \
curl -s http://localhost:4000/health/liveliness
Virtual key management
Once the proxy is up, the master key unlocks the admin API. Create a team and issue virtual keys:
# Create a team with a monthly budget
curl -X POST https://llm.example.ae/team/new \
-H "Authorization: Bearer $MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"team_alias": "customer-support",
"max_budget": 2000,
"budget_duration": "30d",
"tpm_limit": 100000,
"rpm_limit": 1000,
"models": ["gpt-4o-uae-primary", "claude-sonnet-bedrock-me"]
}'
# Issue a key for that team
curl -X POST https://llm.example.ae/key/generate \
-H "Authorization: Bearer $MASTER_KEY" \
-H "Content-Type: application/json" \
-d '{
"team_id": "team_cx_abc123",
"key_alias": "customer-support-prod",
"max_budget": 500,
"budget_duration": "30d",
"metadata": {"environment": "production", "owner": "ai-platform"}
}'
The returned key string is the only thing the team’s application ever sees. If it leaks, rotate it without touching provider keys:
curl -X POST https://llm.example.ae/key/regenerate \
-H "Authorization: Bearer $MASTER_KEY" \
-d '{"key": "sk-litellm-old-leaked-key"}'
Finance can now pull real spend data from Postgres:
SELECT
team_alias,
date_trunc('day', created_at) AS day,
SUM(spend) AS daily_spend_usd,
COUNT(*) AS requests
FROM "LiteLLM_SpendLogs" s
JOIN "LiteLLM_TeamTable" t ON t.team_id = s.team_id
WHERE created_at >= NOW() - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY 2 DESC, 3 DESC;
This query is the one that closes the governance gap everyone complains about.
Prompt caching
For RAG or chatbot workloads with repeated prompts, Redis-backed caching saves both money and latency. The config above enables it globally. To opt a specific call in or out:
from openai import OpenAI
client = OpenAI(
base_url="https://llm.example.ae",
api_key="sk-litellm-abc123"
)
# Cached - same prompt returns cached response for 600s
r = client.chat.completions.create(
model="gpt-4o-uae-primary",
messages=[{"role": "user", "content": "Summarize doc X"}],
extra_body={"cache": {"no-cache": False}}
)
# Force-bypass cache (e.g., user explicitly re-ran)
r = client.chat.completions.create(
model="gpt-4o-uae-primary",
messages=[{"role": "user", "content": "Summarize doc X"}],
extra_body={"cache": {"no-cache": True}}
)
Typical cache hit rates we see in production:
- Deterministic agent tool calls: 40-70%
- RAG with stable document sets: 15-30%
- Free-form chatbot: < 5%
Even a 15% hit rate at GPT-4o-class pricing saves meaningful money on a busy cluster.
Rate limiting and budget enforcement
LiteLLM enforces three tiers of limits: org-level (set in general_settings.max_budget), team-level (per-team budget and RPM/TPM), and key-level (per-key budget and limits). Limits are evaluated in that order - the most restrictive wins.
Common patterns:
- Dev/staging isolation - create a
team_alias: devwithmax_budget: 200and restrict to cheap models. Hand out keys freely. - Customer-tier enforcement - one virtual key per customer tier, RPM limits match SLA. Free tier: 10 RPM, Pro: 500 RPM, Enterprise: 5000 RPM.
- Cost-runaway circuit breaker - set
alerting_threshold: 300so anything that blows past its budget by 300% pages the oncall. We’ve caught two serious prompt-injection loops this way.
Langfuse trace enrichment
With the callbacks configured, every request becomes a Langfuse trace. But the default metadata is thin. Enrich it at the call site:
r = client.chat.completions.create(
model="gpt-4o-uae-primary",
messages=[...],
extra_body={
"metadata": {
"generation_name": "rag-answer",
"generation_id": str(uuid.uuid4()),
"trace_id": request_id,
"trace_user_id": user_id,
"session_id": session_id,
"tags": ["prod", "rag", feature_flag_version]
}
}
)
In Langfuse you can then filter traces by user, session, feature flag, or virtual key - which is the view product teams actually use during incident reviews.
Security hardening
- NetworkPolicy - default-deny in the
llm-gatewaynamespace; allow only from client namespaces, to Postgres/Redis, and egress to provider IPs - Egress filtering - restrict egress to the specific provider hostnames (
*.openai.azure.com,bedrock-runtime.*.amazonaws.com, etc.) via a service mesh or egress gateway - Master key rotation - quarterly, via external-secrets; the proxy reads from the secret on pod start, so rotation is a rolling restart
- UI access - put the built-in UI behind your SSO (e.g., oauth2-proxy as a sidecar); don’t rely only on the basic-auth fallback in production
- Redact messages in logs -
redact_messages_in_exceptions: truekeeps prompts out of exception traces. AddLITELLM_REDACT_UI_MESSAGES=trueif the UI shouldn’t display raw prompts to non-admins.
Observability
ServiceMonitor-scraped metrics worth alerting on:
litellm_total_tokensandlitellm_spend_metric- track budgetslitellm_request_total_latency{model=...}- p95 and p99 per model; fallback is activating if primary p99 is climbinglitellm_deployment_state{deployment=...}- 0 = cooldown (failing). Any non-zero sustained means a provider is degraded.litellm_cache_hits_total / litellm_total_requests- cache hit rate- Postgres
pg_stat_activity- spend-logging can saturate if RPS outruns log-writer throughput; watchidle in transactioncounts
Sizing tiers
| Tier | Requests/sec | Proxy replicas | Postgres | Redis | Est. monthly cost (AED, EKS me-central-1) |
|---|---|---|---|---|---|
| Small | <50 | 3 × 500m CPU / 1 GB | db.t3.medium, 50 GB | cache.t3.small | ~4,000 |
| Medium | 50-500 | 6 × 1 CPU / 2 GB | db.r6g.large, 200 GB | cache.r6g.medium | ~18,000 |
| Large | 500-5000 | 20 × 2 CPU / 4 GB | db.r6g.2xlarge, 500 GB | cache.r6g.xlarge cluster | ~75,000 |
The proxy itself is cheap; most cost is the downstream providers. Track upstream LLM spend separately.
Common failure modes we’ve debugged
- Requests hang for 60s then timeout - one provider in a fallback chain has a wedged connection and the chart’s default retry config doesn’t close it. Set
num_retries: 2,timeout: 30,allowed_fails: 3,cooldown_time: 30. - Budget enforcement lags by minutes - the proxy caches budget state per pod. When a key exceeds the limit, other pods may still accept requests until the cache refreshes. Force tighter consistency with
general_settings.disable_spend_logs: falseand shorter cache TTL, or accept the lag (usually fine for daily budgets). - Postgres hits connection limit under load - the default per-pod connection pool is too large. Set
DATABASE_CONNECTION_LIMIT=20and size Postgres forpod_count × 20connections plus overhead. - “Model not found” errors for models in the config - the proxy reads
config.yamlon boot only. After changing the config via the UI, reload pods. WithSTORE_MODEL_IN_DB=Truethis is less of an issue, but you still need a rolling restart forrouter_settingschanges. - Langfuse traces missing batches of requests - the Langfuse callback is async with a batched flush. On pod termination, in-flight batches can be lost. Set
termination_grace_period_seconds: 60andpreStopsleep 30s to let the flush complete.
What this connects to
LiteLLM is the policy and routing layer of a production LLM stack. Pair it with:
- Langfuse for tracing and evaluation - see Deploy Langfuse on Kubernetes
- Qdrant as the retrieval layer for RAG - see Deploy Qdrant on Kubernetes
- vLLM as a self-hosted provider option, registered in the LiteLLM model list like any other OpenAI-compatible endpoint
- KEDA if you need request-rate-based autoscaling beyond plain HPA
Our pillar post stitches this into a complete production RAG reference architecture.
Getting help
We deploy LiteLLM as the gateway layer for GCC AI platforms with mixed Azure UAE / Bedrock Bahrain / self-hosted topologies - the kind of multi-provider setup that regulated industries actually need. If you want help sizing, a virtual-key taxonomy for your teams, or a cutover from direct provider usage to gateway-mediated, AI/ML Infrastructure on K8s is the engagement. Typical rollout is 2-3 weeks.
Frequently Asked Questions
What problem does LiteLLM actually solve?
LiteLLM proxy sits between your applications and LLM providers (OpenAI, Anthropic, Azure OpenAI, Bedrock, Vertex, Cohere, self-hosted vLLM). It centralizes four things that every enterprise LLM program needs: (1) a single OpenAI-compatible endpoint so apps don't couple to any one provider, (2) virtual API keys with per-team budgets and rate limits, (3) automatic fallback and load balancing across providers, and (4) uniform logging and cost attribution. Without it, every team builds their own half-baked version in application code.
Is LiteLLM production-ready for enterprise?
Yes. The proxy is used in production by companies including Lemonade, Adobe, Rocket Money, and Netflix at millions of requests per day. The Python FastAPI service scales horizontally and is stateless when backed by Postgres for key/budget state and Redis for caching. The operational risks are standard for any high-throughput API gateway: connection pool tuning, Postgres write amplification from spend logging, and Redis sizing for the cache hit rate you want.
How do virtual keys in LiteLLM work?
A virtual key is a LiteLLM-issued token mapped to a team, user, or application in the proxy's Postgres database. Each key can have an allowed model list, a monthly or daily budget in USD, a requests-per-minute limit, and custom metadata. When a request comes in with Authorization: Bearer sk-litellm-abc123, the proxy looks up the key, enforces the limits, rewrites the request to the right provider with the real provider API key, and logs the spend. Applications never see provider keys.
How do I integrate LiteLLM with Langfuse for observability?
LiteLLM has native Langfuse integration. Set success_callback: ["langfuse"] in the proxy config and provide LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST as env vars. Every request becomes a Langfuse trace with model, tokens, latency, cost, and virtual key attribution. This is the standard pattern we deploy: LiteLLM owns policy and routing, Langfuse owns traces and evaluation.
Can LiteLLM handle provider fallback when one region is down?
Yes, and this is one of the best reasons to deploy it. Define a model group with multiple litellm_params entries - for example, Azure OpenAI UAE North primary with OpenAI US as fallback. If the primary returns a 5xx or times out, the proxy retries against the next provider in the group. Combine with router_settings.routing_strategy: latency-based to always prefer the fastest healthy upstream. We use this pattern for GCC workloads that require UAE residency during normal operation but tolerate regional failover for availability.
How do I deploy LiteLLM in a UAE-sovereign setup?
Deploy the proxy into an in-region Kubernetes cluster and point it at in-region LLM providers: Azure OpenAI in UAE North, Bedrock in Middle East Bahrain, or self-hosted vLLM on your own GPU cluster. Postgres and Redis also stay in-region. The key design choice is keeping all success_callback destinations (Langfuse, S3 log sink) in the same region so trace data doesn't egress. LiteLLM itself has no home-phone telemetry - disable update check via LITELLM_DISABLE_VERSION_CHECK=true.
Get Started for Free
We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.
Talk to an Expert