April 22, 2026 · 10 min read

Deploy Qdrant on Kubernetes: Production HA Guide (2026)

Run Qdrant vector database in production on Kubernetes: HA cluster topology, sharding and replication, memory sizing for HNSW, snapshots to S3, API key security, NetworkPolicy, and GCC data-sovereign deployment patterns.

Deploy Qdrant on Kubernetes: Production HA Guide (2026)

Qdrant overtook Weaviate, Milvus, and Pinecone on new-project adoption in 2025 because it nails the trade-off RAG teams care about: sub-50ms filtered vector search at 100M+ vectors, with a deployment story simple enough that one engineer can operate it. Running Qdrant on Kubernetes in production is straightforward if you respect one rule: it’s a stateful, memory-bound workload, not a stateless web service.

This guide covers the topology we deploy for clients - HA cluster, per-collection sharding, snapshots to S3, and the security controls GCC audits look for.

When Qdrant is the right vector DB choice

The vector DB market has converged on three viable serious options: Qdrant, Milvus, and pgvector. Quick orientation:

  • Qdrant - best general-purpose choice. Rust-native, excellent filtered search, simplest ops. Go-to for RAG applications under 1B vectors.
  • Milvus - higher ceiling at extreme scale (10B+ vectors) but operationally heavier. Uses etcd, Pulsar/Kafka, MinIO - a full distributed system in a box.
  • pgvector - correct if you already operate Postgres, don’t need sub-50ms latency, and are under 10M vectors.

If you’re here, you’ve picked Qdrant. The rest of this post assumes you know why.

Architecture refresher

                    Client (your RAG app)
                          │
                  HTTP/gRPC (TLS + API key)
                          │
                          ▼
                  ┌────────────────┐
                  │    Ingress     │  (cert-manager, ingress-nginx)
                  └───────┬────────┘
                          │
          ┌───────────────┼───────────────┐
          ▼               ▼               ▼
    ┌──────────┐    ┌──────────┐    ┌──────────┐
    │ qdrant-0 │◄──►│ qdrant-1 │◄──►│ qdrant-2 │   Raft consensus on :6335
    │  peer    │    │  peer    │    │  peer    │
    │ shard A1 │    │ shard A2 │    │ shard B1 │   Collection shards distributed
    │ shard B2 │    │ shard B1 │    │ shard A1 │   across peers with
    └─────┬────┘    └─────┬────┘    └─────┬────┘   replication_factor=2
          │               │               │
          ▼               ▼               ▼
       PVC gp3         PVC gp3         PVC gp3      One PVC per peer
          │               │               │
          └───────────────┼───────────────┘
                          ▼
                    ┌──────────┐
                    │    S3    │   Snapshots (backup only)
                    └──────────┘

Invariants:

  • Peers are symmetric. Every Qdrant pod runs the same binary and participates in Raft. There is no leader/follower at the cluster level.
  • Shards are per-collection. Collection A can have 4 shards, Collection B can have 12 shards, distributed independently.
  • Replication is per-collection. A collection’s replication_factor determines how many peers hold a copy of each shard.
  • Storage is local. Each peer has its own PVC. Data is not shared via network filesystems.
  • Consensus is on port 6335. Client traffic is on 6333 (HTTP) and 6334 (gRPC). Firewall them independently.

Prerequisites

kubectl version --client            # 1.28+
helm version                        # 3.14+

Cluster add-ons:

  • cert-manager for TLS
  • ingress-nginx (or a gRPC-capable gateway)
  • external-secrets-operator to sync API keys from your secrets backend
  • prometheus-operator for ServiceMonitor scraping
  • A fast SSD StorageClass (gp3, pd-ssd, Premium_LRS) - this is non-negotiable

Sizing: the one thing that matters

Before you write a line of YAML, compute your RAM budget. Qdrant’s HNSW index lives in memory for any collection that is queried frequently. The rough formula for a float32 vector collection:

RAM per vector ≈ (vector_dim × 4 bytes) + (HNSW graph overhead ≈ m × 12 bytes) + payload_index_overhead

Practical heuristics we use for sizing quotes:

Collection size (vectors)DimensionDefault HNSWQuantizationRAM needed per replica
1M768m=16none~3 GB
10M768m=16none~15 GB
100M768m=16scalar int8~15 GB (with quantization)
100M768m=16none~120 GB (split across shards)
1B1536m=32product~60 GB per shard, 10 shards minimum

Above 50M vectors, always turn on quantization. Scalar int8 gives ~4x compression with roughly 1-2% recall loss. Product quantization gives up to 32x compression for larger accuracy trade-offs. Measure on your real data, not the defaults.

Helm install: production values

Add the repo:

helm repo add qdrant https://qdrant.to/helm
helm repo update

kubectl create namespace vectordb
kubectl label namespace vectordb \
  pod-security.kubernetes.io/enforce=restricted

Production values.yaml:

# values.prod.yaml
image:
  tag: "v1.12.4"                 # pin exact version

replicaCount: 3

resources:
  requests:
    cpu: "4"
    memory: "32Gi"
  limits:
    memory: "32Gi"               # hard limit == request for predictable scheduling

persistence:
  size: 500Gi
  storageClassName: gp3
  accessModes: [ReadWriteOnce]

podDisruptionBudget:
  enabled: true
  minAvailable: 2                # cluster of 3 tolerates 1 disruption

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app.kubernetes.io/name: qdrant

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values: [qdrant]
        topologyKey: kubernetes.io/hostname

config:
  service:
    api_key: null                # set via env from secret
    read_only_api_key: null
  tls:
    cert: /qdrant/tls/tls.crt
    key: /qdrant/tls/tls.key
    ca_cert: /qdrant/tls/ca.crt
  cluster:
    enabled: true
    p2p:
      port: 6335
    consensus:
      tick_period_ms: 100
  storage:
    # wal fsync every N ms - don't set to 0 in production
    wal:
      wal_capacity_mb: 64
      wal_segments_ahead: 0
    performance:
      max_search_threads: 0       # 0 = use all available
      max_optimization_runtime_threads: 2
    optimizers:
      default_segment_number: 0   # auto
      indexing_threshold: 20000

env:
  - name: QDRANT__SERVICE__API_KEY
    valueFrom:
      secretKeyRef:
        name: qdrant-api-keys
        key: api-key
  - name: QDRANT__SERVICE__READ_ONLY_API_KEY
    valueFrom:
      secretKeyRef:
        name: qdrant-api-keys
        key: read-only-api-key

service:
  type: ClusterIP
  # separate p2p service handled by chart

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
    nginx.ingress.kubernetes.io/grpc-backend: "true"
  hosts:
    - host: qdrant.example.ae
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: qdrant-ingress-tls
      hosts: [qdrant.example.ae]

metrics:
  serviceMonitor:
    enabled: true
    namespace: monitoring
    labels:
      release: kube-prometheus-stack

Install:

helm upgrade --install qdrant qdrant/qdrant \
  --namespace vectordb \
  --values values.prod.yaml \
  --version 1.12.4 \
  --wait --timeout 10m

Verify cluster formation:

kubectl exec -n vectordb qdrant-0 -- \
  curl -s -H "api-key: $QDRANT_KEY" http://localhost:6333/cluster | jq

You should see three peers, all in state: "Active", with the same raft_info.term.

Collection design: shards and replicas

This is where most teams leave performance and availability on the table. Create collections explicitly, not with defaults:

curl -X PUT "https://qdrant.example.ae/collections/documents" \
  -H "api-key: $QDRANT_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "vectors": {
      "size": 768,
      "distance": "Cosine",
      "on_disk": false
    },
    "shard_number": 6,
    "replication_factor": 2,
    "write_consistency_factor": 1,
    "hnsw_config": {
      "m": 16,
      "ef_construct": 100,
      "full_scan_threshold": 10000
    },
    "quantization_config": {
      "scalar": {
        "type": "int8",
        "quantile": 0.99,
        "always_ram": true
      }
    },
    "optimizers_config": {
      "indexing_threshold": 20000,
      "memmap_threshold": 50000
    }
  }'

Shard-count guidance:

  • Start with shard_number = peer_count × 2. Six shards on a three-peer cluster gives good rebalance headroom.
  • replication_factor = 2 for normal production. Use 3 if your SLA allows zero query failures during a rolling upgrade.
  • write_consistency_factor = 1 is the common choice. Set to 2 for stronger consistency at the cost of write throughput.
  • Pin on_disk: false for hot collections (RAM-resident vectors). Set true only for cold archives queried rarely.
  • Always enable quantization_config with always_ram: true above 50M vectors.

Snapshots and backup

Qdrant’s snapshot API writes a .snapshot file per shard to local disk. Production pattern: CronJob calls the snapshot endpoint, then uploads the file to S3.

apiVersion: batch/v1
kind: CronJob
metadata:
  name: qdrant-snapshot
  namespace: vectordb
spec:
  schedule: "0 2 * * *"          # 02:00 daily
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 7
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          serviceAccountName: qdrant-snapshot
          containers:
            - name: snapshot
              image: amazon/aws-cli:2.17.0
              env:
                - name: QDRANT_URL
                  value: "http://qdrant.vectordb.svc.cluster.local:6333"
                - name: QDRANT_API_KEY
                  valueFrom:
                    secretKeyRef:
                      name: qdrant-api-keys
                      key: api-key
                - name: S3_BUCKET
                  value: "qdrant-snapshots-me-central-1"
              command:
                - /bin/sh
                - -c
                - |
                  set -euo pipefail
                  for COLLECTION in documents chunks embeddings; do
                    SNAP=$(curl -sf -X POST \
                      -H "api-key: $QDRANT_API_KEY" \
                      "$QDRANT_URL/collections/$COLLECTION/snapshots" \
                      | jq -r '.result.name')
                    echo "Created snapshot $SNAP for $COLLECTION"
                    curl -sf -H "api-key: $QDRANT_API_KEY" \
                      "$QDRANT_URL/collections/$COLLECTION/snapshots/$SNAP" \
                      -o /tmp/$SNAP
                    aws s3 cp /tmp/$SNAP \
                      "s3://$S3_BUCKET/$(date +%F)/$COLLECTION/$SNAP" \
                      --sse AES256
                    curl -sf -X DELETE -H "api-key: $QDRANT_API_KEY" \
                      "$QDRANT_URL/collections/$COLLECTION/snapshots/$SNAP"
                    rm -f /tmp/$SNAP
                  done

Add a lifecycle rule on the S3 bucket: 30 days in Standard, 90 in Standard-IA, delete after 365.

Restore drill. Every quarter, run:

# On a scratch cluster, restore the latest snapshot and run a known query
curl -X PUT "https://qdrant-dr.example.ae/collections/documents/snapshots/recover" \
  -H "api-key: $QDRANT_KEY" \
  -H "Content-Type: application/json" \
  -d '{"location": "s3://qdrant-snapshots-me-central-1/2026-04-21/documents/..."}'

If the restore doesn’t complete within your RTO budget, raise snapshot frequency or shard into smaller collections.

Network isolation

Qdrant exposes three sensitive ports:

  • 6333 (HTTP) - client traffic. Authenticated by API key, but rate-limit anyway.
  • 6334 (gRPC) - client traffic. Same treatment.
  • 6335 (P2P/Raft) - inter-peer consensus. Never expose outside the namespace.

Default-deny NetworkPolicy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: qdrant-default-deny
  namespace: vectordb
spec:
  podSelector: {}
  policyTypes: [Ingress, Egress]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: qdrant-cluster-internal
  namespace: vectordb
spec:
  podSelector:
    matchLabels:
      app.kubernetes.io/name: qdrant
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: qdrant
      ports:
        - protocol: TCP
          port: 6335
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ingress-nginx
      ports:
        - protocol: TCP
          port: 6333
        - protocol: TCP
          port: 6334
  egress:
    - to:
        - podSelector:
            matchLabels:
              app.kubernetes.io/name: qdrant
      ports:
        - protocol: TCP
          port: 6335
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              k8s-app: kube-dns
      ports:
        - protocol: UDP
          port: 53

Tighten the ingress allow rule to only specific client namespaces once you know them.

Observability

Qdrant exposes Prometheus metrics at /metrics. Dashboards to build on day one:

  • Collection health - qdrant_collections_total, qdrant_collection_segments_total, qdrant_collection_vectors_total
  • Query latency - qdrant_rest_responses_duration_seconds bucketed. Alert p99 > 100ms.
  • Indexing progress - qdrant_collection_indexing_operations_total divided by total points. Stuck indexing is a common failure mode.
  • Cluster consensus - qdrant_cluster_peers and qdrant_cluster_pending_operations. Pending operations piling up means a peer is struggling.
  • Memory pressure - container_memory_working_set_bytes / pod_memory_limit. Above 90% means you’re swapping HNSW, which murders latency.

Ship query traces to your existing OTel collector - Qdrant v1.10+ supports OTel.

Sizing tiers

TierVectorsPeersRAM/peerStorage/peerEst. monthly cost (AED, EKS me-central-1)
Small<10M3 × r6i.xlarge32 GB500 GB gp3~10,000
Medium10M-100M3 × r6i.2xlarge + int8 quant64 GB1 TB gp3~28,000
Large100M-1B6 × r6i.4xlarge + product quant128 GB2 TB gp3~90,000
XL1B+9+ × r6i.8xlarge + product quant + on_disk256 GB4 TB io2~280,000

Add 20% for snapshot storage and DR cluster.

GCC data sovereignty checklist

For UAE clients:

  • Qdrant cluster, snapshot bucket, and any caching/CDN layer in the same in-region cloud zone
  • TLS mandatory on 6333/6334, mTLS between peers on 6335
  • API key rotation automated (external-secrets + Secrets Manager / Key Vault)
  • Snapshot bucket encrypted with customer-managed KMS key
  • Audit log from ingress controller shipped to client SIEM
  • No Qdrant Cloud or Qdrant Telemetry - disable telemetry_disabled: true in config
  • RBAC: only the RAG application’s ServiceAccount can hit the Qdrant service; no kubectl port-forward exceptions in prod namespace

Common failure modes we’ve debugged

  • Query latency spikes every N minutes - segment optimization is blocking search threads. Lower max_optimization_runtime_threads or move optimization to off-hours via optimizer_config.max_optimization_runtime_threads: 1.
  • One peer falls out of sync after a node drain - minAvailable: 2 was missing on the PDB, so two peers drained simultaneously. Re-add the PDB.
  • OOM kills during bulk upsert - default indexing_threshold: 20000 is too aggressive at scale. Raise to 100,000 or 0 (disable during bulk load) and run update_collection to reindex afterwards.
  • Snapshots succeed but restores fail with “invalid snapshot” - the snapshot was downloaded with a client that corrupts binary data. Always use curl with -o or aws s3 cp --binary-mode.
  • Recall drops mysteriously after enabling quantization - quantile: 0.99 default trims outliers too aggressively for some distributions. Try 1.0 and measure.

What this connects to

Qdrant is the retrieval half of a RAG stack. The full production pattern:

  • Embedding pipeline - vLLM or text-embeddings-inference serving the embedding model on GPU nodes
  • LLM gateway - LiteLLM fronting the generation model
  • Observability - Langfuse capturing retrieval + generation traces (see our Langfuse on K8s guide)
  • Orchestration - LangGraph, LlamaIndex, or custom RAG service

Our upcoming pillar post stitches these together into a full reference architecture.

Getting help

We deploy and operate Qdrant in production for GCC AI teams running customer-facing RAG - including regulated fintech and government workloads on sovereign cloud. If you want a capacity plan, a topology review, or a migration from Pinecone or Qdrant Cloud, our AI/ML Infrastructure on K8s engagement is the entry point. Typical time to production: 2-4 weeks depending on data volume.

Frequently Asked Questions

How much memory does Qdrant need in production?

Qdrant holds the HNSW index in memory for every collection that is queried frequently. Budget roughly 1.5 KB of RAM per vector for a 768-dimension embedding at default HNSW parameters (m=16, ef_construct=100). A 10M-vector collection therefore needs ~15 GB of RAM just for the index, before payload storage or query working memory. At 100M vectors, use scalar or product quantization to compress the index by 4-32x.

Should I use the Qdrant Operator or the Helm chart?

Use the Qdrant Operator for production. The Helm chart is fine for single-node or static clusters, but the operator handles rolling upgrades with consensus awareness, automatic snapshot scheduling, shard rebalancing, and CRD-driven collection management. The chart requires manual coordination when scaling nodes or recovering from failures. For clusters under five nodes that rarely change, the chart is acceptable; above that, the operator pays for itself within a quarter.

How does Qdrant handle high availability?

Qdrant uses Raft consensus across cluster peers plus per-collection shard replication. Set replication_factor: 2 or 3 on production collections so each shard lives on multiple nodes. Combined with pod anti-affinity across K8s nodes and a minimum of 3 Qdrant peers, a single-node failure causes no downtime or data loss. Replication factor 1 is a common misconfiguration that looks like HA because there are multiple pods but actually has no fault tolerance.

How do I back up Qdrant running on Kubernetes?

Qdrant provides a native snapshot API that writes per-shard snapshots to local disk, which you then replicate to S3. The Qdrant Operator automates this via a QdrantSnapshot CRD with a cron schedule. For deployments without the operator, run a CronJob that calls POST /collections/{name}/snapshots and uploads the output to S3 with server-side encryption. Restore is per-collection, so coordinate multi-collection restores manually. Test the restore path quarterly - several teams we've worked with discovered their backup job had been silently failing for months.

What storage class should I use for Qdrant on EKS, AKS, and GKE?

Use fast NVMe-class SSDs: gp3 (EBS) or io2 on EKS, Premium_LRS or Ultra_LRS on AKS, and pd-ssd on GKE. Avoid standard HDD classes or network-attached cold storage - Qdrant performs lots of small random reads during HNSW construction and filtered queries. For extreme latency budgets, use local NVMe with a backup-and-replace strategy instead of network storage, but only if you have replication factor 2+ so you can tolerate node loss.

Can I run Qdrant in a UAE data-sovereign environment?

Yes. Qdrant is open-source and has no external dependencies beyond object storage for snapshots. Deploy it into an in-region Kubernetes cluster (Azure UAE North, AWS Middle East Bahrain or UAE, or a sovereign cloud like Core42) with the snapshot bucket in the same region. Qdrant Cloud hosts data in EU/US/APAC regions and is not suitable for workloads covered by NESA or CBUAE data residency requirements.

Get Started for Free

We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.

Talk to an Expert