April 22, 2026 · 9 min read

Deploy LiteLLM Proxy on Kubernetes: Enterprise LLM Gateway Guide (2026)

Run LiteLLM as a production LLM gateway on Kubernetes: virtual keys, per-team budgets, provider fallbacks, Redis caching, Postgres persistence, Langfuse tracing, and GCC-sovereign multi-provider routing across OpenAI, Azure, Bedrock, and Vertex.

Deploy LiteLLM Proxy on Kubernetes: Enterprise LLM Gateway Guide (2026)

Every enterprise rolling out an LLM program hits the same problem around month three: engineering teams have proliferated provider accounts. One team is on OpenAI with a corporate card, another got Azure OpenAI via Enterprise Agreement, a third is hitting Bedrock directly from a Lambda, and nobody can tell finance which team spent what. Add in the GCC data-residency requirements and the situation compounds fast.

LiteLLM proxy on Kubernetes is the fix. It’s a single OpenAI-compatible endpoint that fronts every provider, enforces virtual-key budgets, and hands clean cost data to finance. This guide is the production deployment we use with clients.

Architecture

                       ┌────────────────────────────────┐
   Client App  ────────▶│  LiteLLM Proxy (FastAPI)      │
   (OpenAI SDK)         │  Deployment: 3+ replicas       │
   sk-litellm-abc123    └───┬────────────────────┬───────┘
                            │                    │
                 Virtual key│              Model lookup, budget check,
                 lookup     │              rate limit, rewrite to provider
                            ▼                    │
                     ┌────────────┐               │
                     │  Postgres  │               │
                     │  (keys,    │               │
                     │  budgets,  │               │
                     │  spend)    │               │
                     └────────────┘               │
                                                  │
                     ┌────────────┐               │
                     │   Redis    │◀──────────────┘
                     │  (cache,   │     Cache hits, rate limits,
                     │  RL state) │     streaming coordination
                     └────────────┘
                                                  │
              ┌───────────────────┬──────────────┼──────────────┬──────────────┐
              ▼                   ▼              ▼              ▼              ▼
        Azure OpenAI         Bedrock        OpenAI        Anthropic       Self-hosted
        (UAE North)          (me-south-1)   (US)          (US)            vLLM (cluster)
              │                   │              │              │              │
              └───────────────────┴──────────────┴──────────────┴──────────────┘
                                       │
                                       ▼
                                 ┌──────────┐
                                 │ Langfuse │  Traces, costs, evals
                                 └──────────┘

Invariants:

  • LiteLLM proxy is stateless. Scale horizontally with plain HPA on request rate or CPU.
  • Postgres is the policy database. Holds virtual keys, teams, budgets, spend aggregates. Must be HA.
  • Redis is optional but recommended. Required for prompt caching and distributed rate limiting.
  • Provider keys live in Kubernetes secrets, mounted into the proxy as env vars.
  • Callbacks to Langfuse/DataDog/S3 run async after the response returns, so they don’t add user-facing latency.

Prerequisites

kubectl version --client    # 1.28+
helm version                # 3.14+

Dependencies you need provisioned:

  • Postgres 14+ with a dedicated litellm database
  • Redis 7+ (single instance or Sentinel)
  • Provider API keys in your secrets backend (AWS Secrets Manager, Azure Key Vault, Vault)
  • A Langfuse instance if you want integrated tracing - see our Langfuse guide

Helm install

Add the official chart:

helm repo add litellm https://berriai.github.io/litellm-helm
helm repo update

kubectl create namespace llm-gateway

Production values.yaml:

# values.prod.yaml
image:
  repository: ghcr.io/berriai/litellm-database
  tag: "v1.55.3-stable"      # use the -stable tag in prod
  pullPolicy: IfNotPresent

replicaCount: 3

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    memory: "4Gi"

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 20
  targetCPUUtilizationPercentage: 70

podDisruptionBudget:
  enabled: true
  minAvailable: 2

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: litellm

masterkey: null                # must come from secret
masterkeySecretName: litellm-master
masterkeySecretKey: master-key

db:
  deployStandalone: false
  useExisting: true
  endpoint: "litellm-pg-rw.data.svc.cluster.local"
  database: "litellm"
  url: null                    # built from env secret
  useStackgresCluster: false

redis:
  enabled: false               # using external Redis
  host: "litellm-redis-master.data.svc.cluster.local"
  port: 6379

envVars:
  - name: DATABASE_URL
    valueFrom:
      secretKeyRef:
        name: litellm-db-creds
        key: database-url
  - name: REDIS_HOST
    value: "litellm-redis-master.data.svc.cluster.local"
  - name: REDIS_PORT
    value: "6379"
  - name: REDIS_PASSWORD
    valueFrom:
      secretKeyRef:
        name: litellm-redis-creds
        key: password
  - name: STORE_MODEL_IN_DB
    value: "True"
  - name: UI_USERNAME
    value: "admin"
  - name: UI_PASSWORD
    valueFrom:
      secretKeyRef:
        name: litellm-ui-creds
        key: password
  - name: LITELLM_LOG
    value: "INFO"
  - name: LITELLM_DISABLE_VERSION_CHECK
    value: "true"
  # Provider keys (examples - adapt to your providers)
  - name: AZURE_OPENAI_API_KEY_UAE
    valueFrom:
      secretKeyRef:
        name: llm-provider-keys
        key: azure-uae-north
  - name: AZURE_OPENAI_ENDPOINT_UAE
    value: "https://mycompany-uae.openai.azure.com"
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: llm-provider-keys
        key: aws-access-key
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: llm-provider-keys
        key: aws-secret-key
  - name: AWS_REGION_NAME
    value: "me-south-1"
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: llm-provider-keys
        key: anthropic
  # Langfuse integration
  - name: LANGFUSE_PUBLIC_KEY
    valueFrom:
      secretKeyRef:
        name: langfuse-integration
        key: public-key
  - name: LANGFUSE_SECRET_KEY
    valueFrom:
      secretKeyRef:
        name: langfuse-integration
        key: secret-key
  - name: LANGFUSE_HOST
    value: "https://langfuse.example.ae"

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "600"
  hosts:
    - host: llm.example.ae
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: litellm-tls
      hosts: [llm.example.ae]

serviceMonitor:
  enabled: true
  namespace: monitoring
  labels:
    release: kube-prometheus-stack

# The config.yaml passed to the proxy. This is where the real policy lives.
proxy_config:
  model_list:
    - model_name: gpt-4o-uae-primary
      litellm_params:
        model: azure/gpt-4o
        api_base: os.environ/AZURE_OPENAI_ENDPOINT_UAE
        api_key: os.environ/AZURE_OPENAI_API_KEY_UAE
        api_version: "2024-08-01-preview"
      model_info:
        mode: chat
    - model_name: claude-sonnet-bedrock-me
      litellm_params:
        model: bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0
        aws_region_name: os.environ/AWS_REGION_NAME
    - model_name: claude-sonnet-fallback
      litellm_params:
        model: anthropic/claude-3-5-sonnet-20241022
        api_key: os.environ/ANTHROPIC_API_KEY

  # Model groups with fallbacks
  router_settings:
    routing_strategy: latency-based-routing
    fallbacks:
      - gpt-4o-uae-primary: [claude-sonnet-bedrock-me]
      - claude-sonnet-bedrock-me: [claude-sonnet-fallback]
    redis_host: os.environ/REDIS_HOST
    redis_port: os.environ/REDIS_PORT
    redis_password: os.environ/REDIS_PASSWORD
    num_retries: 2
    timeout: 60
    allowed_fails: 3
    cooldown_time: 30

  # Global settings
  litellm_settings:
    drop_params: true
    success_callback: ["langfuse"]
    failure_callback: ["langfuse"]
    cache: true
    cache_params:
      type: redis
      host: os.environ/REDIS_HOST
      port: os.environ/REDIS_PORT
      password: os.environ/REDIS_PASSWORD
      ttl: 600
      # Only cache prompts marked explicitly
      supported_call_types: ["acompletion", "aembedding"]

  general_settings:
    master_key: os.environ/LITELLM_MASTER_KEY
    database_url: os.environ/DATABASE_URL
    store_model_in_db: true
    alerting: ["slack"]
    alerting_threshold: 300
    budget_duration: 30d
    max_budget: 50000           # USD per month org-wide hard cap
    # Prevents key leakage in logs
    redact_messages_in_exceptions: true

Install:

helm upgrade --install litellm litellm/litellm \
  --namespace llm-gateway \
  --values values.prod.yaml \
  --version 0.5.0 \
  --wait --timeout 10m

Check the proxy is healthy:

kubectl exec -n llm-gateway deploy/litellm -- \
  curl -s http://localhost:4000/health/liveliness

Virtual key management

Once the proxy is up, the master key unlocks the admin API. Create a team and issue virtual keys:

# Create a team with a monthly budget
curl -X POST https://llm.example.ae/team/new \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "team_alias": "customer-support",
    "max_budget": 2000,
    "budget_duration": "30d",
    "tpm_limit": 100000,
    "rpm_limit": 1000,
    "models": ["gpt-4o-uae-primary", "claude-sonnet-bedrock-me"]
  }'

# Issue a key for that team
curl -X POST https://llm.example.ae/key/generate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "team_id": "team_cx_abc123",
    "key_alias": "customer-support-prod",
    "max_budget": 500,
    "budget_duration": "30d",
    "metadata": {"environment": "production", "owner": "ai-platform"}
  }'

The returned key string is the only thing the team’s application ever sees. If it leaks, rotate it without touching provider keys:

curl -X POST https://llm.example.ae/key/regenerate \
  -H "Authorization: Bearer $MASTER_KEY" \
  -d '{"key": "sk-litellm-old-leaked-key"}'

Finance can now pull real spend data from Postgres:

SELECT
  team_alias,
  date_trunc('day', created_at) AS day,
  SUM(spend) AS daily_spend_usd,
  COUNT(*) AS requests
FROM "LiteLLM_SpendLogs" s
JOIN "LiteLLM_TeamTable" t ON t.team_id = s.team_id
WHERE created_at >= NOW() - INTERVAL '30 days'
GROUP BY 1, 2
ORDER BY 2 DESC, 3 DESC;

This query is the one that closes the governance gap everyone complains about.

Prompt caching

For RAG or chatbot workloads with repeated prompts, Redis-backed caching saves both money and latency. The config above enables it globally. To opt a specific call in or out:

from openai import OpenAI

client = OpenAI(
    base_url="https://llm.example.ae",
    api_key="sk-litellm-abc123"
)

# Cached - same prompt returns cached response for 600s
r = client.chat.completions.create(
    model="gpt-4o-uae-primary",
    messages=[{"role": "user", "content": "Summarize doc X"}],
    extra_body={"cache": {"no-cache": False}}
)

# Force-bypass cache (e.g., user explicitly re-ran)
r = client.chat.completions.create(
    model="gpt-4o-uae-primary",
    messages=[{"role": "user", "content": "Summarize doc X"}],
    extra_body={"cache": {"no-cache": True}}
)

Typical cache hit rates we see in production:

  • Deterministic agent tool calls: 40-70%
  • RAG with stable document sets: 15-30%
  • Free-form chatbot: < 5%

Even a 15% hit rate at GPT-4o-class pricing saves meaningful money on a busy cluster.

Rate limiting and budget enforcement

LiteLLM enforces three tiers of limits: org-level (set in general_settings.max_budget), team-level (per-team budget and RPM/TPM), and key-level (per-key budget and limits). Limits are evaluated in that order - the most restrictive wins.

Common patterns:

  • Dev/staging isolation - create a team_alias: dev with max_budget: 200 and restrict to cheap models. Hand out keys freely.
  • Customer-tier enforcement - one virtual key per customer tier, RPM limits match SLA. Free tier: 10 RPM, Pro: 500 RPM, Enterprise: 5000 RPM.
  • Cost-runaway circuit breaker - set alerting_threshold: 300 so anything that blows past its budget by 300% pages the oncall. We’ve caught two serious prompt-injection loops this way.

Langfuse trace enrichment

With the callbacks configured, every request becomes a Langfuse trace. But the default metadata is thin. Enrich it at the call site:

r = client.chat.completions.create(
    model="gpt-4o-uae-primary",
    messages=[...],
    extra_body={
        "metadata": {
            "generation_name": "rag-answer",
            "generation_id": str(uuid.uuid4()),
            "trace_id": request_id,
            "trace_user_id": user_id,
            "session_id": session_id,
            "tags": ["prod", "rag", feature_flag_version]
        }
    }
)

In Langfuse you can then filter traces by user, session, feature flag, or virtual key - which is the view product teams actually use during incident reviews.

Security hardening

  • NetworkPolicy - default-deny in the llm-gateway namespace; allow only from client namespaces, to Postgres/Redis, and egress to provider IPs
  • Egress filtering - restrict egress to the specific provider hostnames (*.openai.azure.com, bedrock-runtime.*.amazonaws.com, etc.) via a service mesh or egress gateway
  • Master key rotation - quarterly, via external-secrets; the proxy reads from the secret on pod start, so rotation is a rolling restart
  • UI access - put the built-in UI behind your SSO (e.g., oauth2-proxy as a sidecar); don’t rely only on the basic-auth fallback in production
  • Redact messages in logs - redact_messages_in_exceptions: true keeps prompts out of exception traces. Add LITELLM_REDACT_UI_MESSAGES=true if the UI shouldn’t display raw prompts to non-admins.

Observability

ServiceMonitor-scraped metrics worth alerting on:

  • litellm_total_tokens and litellm_spend_metric - track budgets
  • litellm_request_total_latency{model=...} - p95 and p99 per model; fallback is activating if primary p99 is climbing
  • litellm_deployment_state{deployment=...} - 0 = cooldown (failing). Any non-zero sustained means a provider is degraded.
  • litellm_cache_hits_total / litellm_total_requests - cache hit rate
  • Postgres pg_stat_activity - spend-logging can saturate if RPS outruns log-writer throughput; watch idle in transaction counts

Sizing tiers

TierRequests/secProxy replicasPostgresRedisEst. monthly cost (AED, EKS me-central-1)
Small<503 × 500m CPU / 1 GBdb.t3.medium, 50 GBcache.t3.small~4,000
Medium50-5006 × 1 CPU / 2 GBdb.r6g.large, 200 GBcache.r6g.medium~18,000
Large500-500020 × 2 CPU / 4 GBdb.r6g.2xlarge, 500 GBcache.r6g.xlarge cluster~75,000

The proxy itself is cheap; most cost is the downstream providers. Track upstream LLM spend separately.

Common failure modes we’ve debugged

  • Requests hang for 60s then timeout - one provider in a fallback chain has a wedged connection and the chart’s default retry config doesn’t close it. Set num_retries: 2, timeout: 30, allowed_fails: 3, cooldown_time: 30.
  • Budget enforcement lags by minutes - the proxy caches budget state per pod. When a key exceeds the limit, other pods may still accept requests until the cache refreshes. Force tighter consistency with general_settings.disable_spend_logs: false and shorter cache TTL, or accept the lag (usually fine for daily budgets).
  • Postgres hits connection limit under load - the default per-pod connection pool is too large. Set DATABASE_CONNECTION_LIMIT=20 and size Postgres for pod_count × 20 connections plus overhead.
  • “Model not found” errors for models in the config - the proxy reads config.yaml on boot only. After changing the config via the UI, reload pods. With STORE_MODEL_IN_DB=True this is less of an issue, but you still need a rolling restart for router_settings changes.
  • Langfuse traces missing batches of requests - the Langfuse callback is async with a batched flush. On pod termination, in-flight batches can be lost. Set termination_grace_period_seconds: 60 and preStop sleep 30s to let the flush complete.

What this connects to

LiteLLM is the policy and routing layer of a production LLM stack. Pair it with:

  • Langfuse for tracing and evaluation - see Deploy Langfuse on Kubernetes
  • Qdrant as the retrieval layer for RAG - see Deploy Qdrant on Kubernetes
  • vLLM as a self-hosted provider option, registered in the LiteLLM model list like any other OpenAI-compatible endpoint
  • KEDA if you need request-rate-based autoscaling beyond plain HPA

Our pillar post stitches this into a complete production RAG reference architecture.

Getting help

We deploy LiteLLM as the gateway layer for GCC AI platforms with mixed Azure UAE / Bedrock Bahrain / self-hosted topologies - the kind of multi-provider setup that regulated industries actually need. If you want help sizing, a virtual-key taxonomy for your teams, or a cutover from direct provider usage to gateway-mediated, AI/ML Infrastructure on K8s is the engagement. Typical rollout is 2-3 weeks.

Frequently Asked Questions

What problem does LiteLLM actually solve?

LiteLLM proxy sits between your applications and LLM providers (OpenAI, Anthropic, Azure OpenAI, Bedrock, Vertex, Cohere, self-hosted vLLM). It centralizes four things that every enterprise LLM program needs: (1) a single OpenAI-compatible endpoint so apps don't couple to any one provider, (2) virtual API keys with per-team budgets and rate limits, (3) automatic fallback and load balancing across providers, and (4) uniform logging and cost attribution. Without it, every team builds their own half-baked version in application code.

Is LiteLLM production-ready for enterprise?

Yes. The proxy is used in production by companies including Lemonade, Adobe, Rocket Money, and Netflix at millions of requests per day. The Python FastAPI service scales horizontally and is stateless when backed by Postgres for key/budget state and Redis for caching. The operational risks are standard for any high-throughput API gateway: connection pool tuning, Postgres write amplification from spend logging, and Redis sizing for the cache hit rate you want.

How do virtual keys in LiteLLM work?

A virtual key is a LiteLLM-issued token mapped to a team, user, or application in the proxy's Postgres database. Each key can have an allowed model list, a monthly or daily budget in USD, a requests-per-minute limit, and custom metadata. When a request comes in with Authorization: Bearer sk-litellm-abc123, the proxy looks up the key, enforces the limits, rewrites the request to the right provider with the real provider API key, and logs the spend. Applications never see provider keys.

How do I integrate LiteLLM with Langfuse for observability?

LiteLLM has native Langfuse integration. Set success_callback: ["langfuse"] in the proxy config and provide LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, and LANGFUSE_HOST as env vars. Every request becomes a Langfuse trace with model, tokens, latency, cost, and virtual key attribution. This is the standard pattern we deploy: LiteLLM owns policy and routing, Langfuse owns traces and evaluation.

Can LiteLLM handle provider fallback when one region is down?

Yes, and this is one of the best reasons to deploy it. Define a model group with multiple litellm_params entries - for example, Azure OpenAI UAE North primary with OpenAI US as fallback. If the primary returns a 5xx or times out, the proxy retries against the next provider in the group. Combine with router_settings.routing_strategy: latency-based to always prefer the fastest healthy upstream. We use this pattern for GCC workloads that require UAE residency during normal operation but tolerate regional failover for availability.

How do I deploy LiteLLM in a UAE-sovereign setup?

Deploy the proxy into an in-region Kubernetes cluster and point it at in-region LLM providers: Azure OpenAI in UAE North, Bedrock in Middle East Bahrain, or self-hosted vLLM on your own GPU cluster. Postgres and Redis also stay in-region. The key design choice is keeping all success_callback destinations (Langfuse, S3 log sink) in the same region so trace data doesn't egress. LiteLLM itself has no home-phone telemetry - disable update check via LITELLM_DISABLE_VERSION_CHECK=true.

Get Started for Free

We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.

Talk to an Expert