Deploy Qdrant on Kubernetes: Production HA Guide (2026)
Run Qdrant vector database in production on Kubernetes: HA cluster topology, sharding and replication, memory sizing for HNSW, snapshots to S3, API key security, NetworkPolicy, and GCC data-sovereign deployment patterns.
Qdrant overtook Weaviate, Milvus, and Pinecone on new-project adoption in 2025 because it nails the trade-off RAG teams care about: sub-50ms filtered vector search at 100M+ vectors, with a deployment story simple enough that one engineer can operate it. Running Qdrant on Kubernetes in production is straightforward if you respect one rule: it’s a stateful, memory-bound workload, not a stateless web service.
This guide covers the topology we deploy for clients - HA cluster, per-collection sharding, snapshots to S3, and the security controls GCC audits look for.
When Qdrant is the right vector DB choice
The vector DB market has converged on three viable serious options: Qdrant, Milvus, and pgvector. Quick orientation:
- Qdrant - best general-purpose choice. Rust-native, excellent filtered search, simplest ops. Go-to for RAG applications under 1B vectors.
- Milvus - higher ceiling at extreme scale (10B+ vectors) but operationally heavier. Uses etcd, Pulsar/Kafka, MinIO - a full distributed system in a box.
- pgvector - correct if you already operate Postgres, don’t need sub-50ms latency, and are under 10M vectors.
If you’re here, you’ve picked Qdrant. The rest of this post assumes you know why.
Architecture refresher
Client (your RAG app)
│
HTTP/gRPC (TLS + API key)
│
▼
┌────────────────┐
│ Ingress │ (cert-manager, ingress-nginx)
└───────┬────────┘
│
┌───────────────┼───────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────┐
│ qdrant-0 │◄──►│ qdrant-1 │◄──►│ qdrant-2 │ Raft consensus on :6335
│ peer │ │ peer │ │ peer │
│ shard A1 │ │ shard A2 │ │ shard B1 │ Collection shards distributed
│ shard B2 │ │ shard B1 │ │ shard A1 │ across peers with
└─────┬────┘ └─────┬────┘ └─────┬────┘ replication_factor=2
│ │ │
▼ ▼ ▼
PVC gp3 PVC gp3 PVC gp3 One PVC per peer
│ │ │
└───────────────┼───────────────┘
▼
┌──────────┐
│ S3 │ Snapshots (backup only)
└──────────┘
Invariants:
- Peers are symmetric. Every Qdrant pod runs the same binary and participates in Raft. There is no leader/follower at the cluster level.
- Shards are per-collection. Collection A can have 4 shards, Collection B can have 12 shards, distributed independently.
- Replication is per-collection. A collection’s
replication_factordetermines how many peers hold a copy of each shard. - Storage is local. Each peer has its own PVC. Data is not shared via network filesystems.
- Consensus is on port 6335. Client traffic is on 6333 (HTTP) and 6334 (gRPC). Firewall them independently.
Prerequisites
kubectl version --client # 1.28+
helm version # 3.14+
Cluster add-ons:
- cert-manager for TLS
- ingress-nginx (or a gRPC-capable gateway)
- external-secrets-operator to sync API keys from your secrets backend
- prometheus-operator for ServiceMonitor scraping
- A fast SSD StorageClass (
gp3,pd-ssd,Premium_LRS) - this is non-negotiable
Sizing: the one thing that matters
Before you write a line of YAML, compute your RAM budget. Qdrant’s HNSW index lives in memory for any collection that is queried frequently. The rough formula for a float32 vector collection:
RAM per vector ≈ (vector_dim × 4 bytes) + (HNSW graph overhead ≈ m × 12 bytes) + payload_index_overhead
Practical heuristics we use for sizing quotes:
| Collection size (vectors) | Dimension | Default HNSW | Quantization | RAM needed per replica |
|---|---|---|---|---|
| 1M | 768 | m=16 | none | ~3 GB |
| 10M | 768 | m=16 | none | ~15 GB |
| 100M | 768 | m=16 | scalar int8 | ~15 GB (with quantization) |
| 100M | 768 | m=16 | none | ~120 GB (split across shards) |
| 1B | 1536 | m=32 | product | ~60 GB per shard, 10 shards minimum |
Above 50M vectors, always turn on quantization. Scalar int8 gives ~4x compression with roughly 1-2% recall loss. Product quantization gives up to 32x compression for larger accuracy trade-offs. Measure on your real data, not the defaults.
Helm install: production values
Add the repo:
helm repo add qdrant https://qdrant.to/helm
helm repo update
kubectl create namespace vectordb
kubectl label namespace vectordb \
pod-security.kubernetes.io/enforce=restricted
Production values.yaml:
# values.prod.yaml
image:
tag: "v1.12.4" # pin exact version
replicaCount: 3
resources:
requests:
cpu: "4"
memory: "32Gi"
limits:
memory: "32Gi" # hard limit == request for predictable scheduling
persistence:
size: 500Gi
storageClassName: gp3
accessModes: [ReadWriteOnce]
podDisruptionBudget:
enabled: true
minAvailable: 2 # cluster of 3 tolerates 1 disruption
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: qdrant
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values: [qdrant]
topologyKey: kubernetes.io/hostname
config:
service:
api_key: null # set via env from secret
read_only_api_key: null
tls:
cert: /qdrant/tls/tls.crt
key: /qdrant/tls/tls.key
ca_cert: /qdrant/tls/ca.crt
cluster:
enabled: true
p2p:
port: 6335
consensus:
tick_period_ms: 100
storage:
# wal fsync every N ms - don't set to 0 in production
wal:
wal_capacity_mb: 64
wal_segments_ahead: 0
performance:
max_search_threads: 0 # 0 = use all available
max_optimization_runtime_threads: 2
optimizers:
default_segment_number: 0 # auto
indexing_threshold: 20000
env:
- name: QDRANT__SERVICE__API_KEY
valueFrom:
secretKeyRef:
name: qdrant-api-keys
key: api-key
- name: QDRANT__SERVICE__READ_ONLY_API_KEY
valueFrom:
secretKeyRef:
name: qdrant-api-keys
key: read-only-api-key
service:
type: ClusterIP
# separate p2p service handled by chart
ingress:
enabled: true
className: nginx
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
nginx.ingress.kubernetes.io/grpc-backend: "true"
hosts:
- host: qdrant.example.ae
paths:
- path: /
pathType: Prefix
tls:
- secretName: qdrant-ingress-tls
hosts: [qdrant.example.ae]
metrics:
serviceMonitor:
enabled: true
namespace: monitoring
labels:
release: kube-prometheus-stack
Install:
helm upgrade --install qdrant qdrant/qdrant \
--namespace vectordb \
--values values.prod.yaml \
--version 1.12.4 \
--wait --timeout 10m
Verify cluster formation:
kubectl exec -n vectordb qdrant-0 -- \
curl -s -H "api-key: $QDRANT_KEY" http://localhost:6333/cluster | jq
You should see three peers, all in state: "Active", with the same raft_info.term.
Collection design: shards and replicas
This is where most teams leave performance and availability on the table. Create collections explicitly, not with defaults:
curl -X PUT "https://qdrant.example.ae/collections/documents" \
-H "api-key: $QDRANT_KEY" \
-H "Content-Type: application/json" \
-d '{
"vectors": {
"size": 768,
"distance": "Cosine",
"on_disk": false
},
"shard_number": 6,
"replication_factor": 2,
"write_consistency_factor": 1,
"hnsw_config": {
"m": 16,
"ef_construct": 100,
"full_scan_threshold": 10000
},
"quantization_config": {
"scalar": {
"type": "int8",
"quantile": 0.99,
"always_ram": true
}
},
"optimizers_config": {
"indexing_threshold": 20000,
"memmap_threshold": 50000
}
}'
Shard-count guidance:
- Start with
shard_number = peer_count × 2. Six shards on a three-peer cluster gives good rebalance headroom. replication_factor = 2for normal production. Use 3 if your SLA allows zero query failures during a rolling upgrade.write_consistency_factor = 1is the common choice. Set to 2 for stronger consistency at the cost of write throughput.- Pin
on_disk: falsefor hot collections (RAM-resident vectors). Settrueonly for cold archives queried rarely. - Always enable
quantization_configwithalways_ram: trueabove 50M vectors.
Snapshots and backup
Qdrant’s snapshot API writes a .snapshot file per shard to local disk. Production pattern: CronJob calls the snapshot endpoint, then uploads the file to S3.
apiVersion: batch/v1
kind: CronJob
metadata:
name: qdrant-snapshot
namespace: vectordb
spec:
schedule: "0 2 * * *" # 02:00 daily
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 7
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
serviceAccountName: qdrant-snapshot
containers:
- name: snapshot
image: amazon/aws-cli:2.17.0
env:
- name: QDRANT_URL
value: "http://qdrant.vectordb.svc.cluster.local:6333"
- name: QDRANT_API_KEY
valueFrom:
secretKeyRef:
name: qdrant-api-keys
key: api-key
- name: S3_BUCKET
value: "qdrant-snapshots-me-central-1"
command:
- /bin/sh
- -c
- |
set -euo pipefail
for COLLECTION in documents chunks embeddings; do
SNAP=$(curl -sf -X POST \
-H "api-key: $QDRANT_API_KEY" \
"$QDRANT_URL/collections/$COLLECTION/snapshots" \
| jq -r '.result.name')
echo "Created snapshot $SNAP for $COLLECTION"
curl -sf -H "api-key: $QDRANT_API_KEY" \
"$QDRANT_URL/collections/$COLLECTION/snapshots/$SNAP" \
-o /tmp/$SNAP
aws s3 cp /tmp/$SNAP \
"s3://$S3_BUCKET/$(date +%F)/$COLLECTION/$SNAP" \
--sse AES256
curl -sf -X DELETE -H "api-key: $QDRANT_API_KEY" \
"$QDRANT_URL/collections/$COLLECTION/snapshots/$SNAP"
rm -f /tmp/$SNAP
done
Add a lifecycle rule on the S3 bucket: 30 days in Standard, 90 in Standard-IA, delete after 365.
Restore drill. Every quarter, run:
# On a scratch cluster, restore the latest snapshot and run a known query
curl -X PUT "https://qdrant-dr.example.ae/collections/documents/snapshots/recover" \
-H "api-key: $QDRANT_KEY" \
-H "Content-Type: application/json" \
-d '{"location": "s3://qdrant-snapshots-me-central-1/2026-04-21/documents/..."}'
If the restore doesn’t complete within your RTO budget, raise snapshot frequency or shard into smaller collections.
Network isolation
Qdrant exposes three sensitive ports:
- 6333 (HTTP) - client traffic. Authenticated by API key, but rate-limit anyway.
- 6334 (gRPC) - client traffic. Same treatment.
- 6335 (P2P/Raft) - inter-peer consensus. Never expose outside the namespace.
Default-deny NetworkPolicy:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: qdrant-default-deny
namespace: vectordb
spec:
podSelector: {}
policyTypes: [Ingress, Egress]
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: qdrant-cluster-internal
namespace: vectordb
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: qdrant
ingress:
- from:
- podSelector:
matchLabels:
app.kubernetes.io/name: qdrant
ports:
- protocol: TCP
port: 6335
- from:
- namespaceSelector:
matchLabels:
kubernetes.io/metadata.name: ingress-nginx
ports:
- protocol: TCP
port: 6333
- protocol: TCP
port: 6334
egress:
- to:
- podSelector:
matchLabels:
app.kubernetes.io/name: qdrant
ports:
- protocol: TCP
port: 6335
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
k8s-app: kube-dns
ports:
- protocol: UDP
port: 53
Tighten the ingress allow rule to only specific client namespaces once you know them.
Observability
Qdrant exposes Prometheus metrics at /metrics. Dashboards to build on day one:
- Collection health -
qdrant_collections_total,qdrant_collection_segments_total,qdrant_collection_vectors_total - Query latency -
qdrant_rest_responses_duration_secondsbucketed. Alert p99 > 100ms. - Indexing progress -
qdrant_collection_indexing_operations_totaldivided by total points. Stuck indexing is a common failure mode. - Cluster consensus -
qdrant_cluster_peersandqdrant_cluster_pending_operations. Pending operations piling up means a peer is struggling. - Memory pressure -
container_memory_working_set_bytes/pod_memory_limit. Above 90% means you’re swapping HNSW, which murders latency.
Ship query traces to your existing OTel collector - Qdrant v1.10+ supports OTel.
Sizing tiers
| Tier | Vectors | Peers | RAM/peer | Storage/peer | Est. monthly cost (AED, EKS me-central-1) |
|---|---|---|---|---|---|
| Small | <10M | 3 × r6i.xlarge | 32 GB | 500 GB gp3 | ~10,000 |
| Medium | 10M-100M | 3 × r6i.2xlarge + int8 quant | 64 GB | 1 TB gp3 | ~28,000 |
| Large | 100M-1B | 6 × r6i.4xlarge + product quant | 128 GB | 2 TB gp3 | ~90,000 |
| XL | 1B+ | 9+ × r6i.8xlarge + product quant + on_disk | 256 GB | 4 TB io2 | ~280,000 |
Add 20% for snapshot storage and DR cluster.
GCC data sovereignty checklist
For UAE clients:
- Qdrant cluster, snapshot bucket, and any caching/CDN layer in the same in-region cloud zone
- TLS mandatory on 6333/6334, mTLS between peers on 6335
- API key rotation automated (external-secrets + Secrets Manager / Key Vault)
- Snapshot bucket encrypted with customer-managed KMS key
- Audit log from ingress controller shipped to client SIEM
- No Qdrant Cloud or Qdrant Telemetry - disable
telemetry_disabled: truein config - RBAC: only the RAG application’s ServiceAccount can hit the Qdrant service; no kubectl port-forward exceptions in prod namespace
Common failure modes we’ve debugged
- Query latency spikes every N minutes - segment optimization is blocking search threads. Lower
max_optimization_runtime_threadsor move optimization to off-hours viaoptimizer_config.max_optimization_runtime_threads: 1. - One peer falls out of sync after a node drain -
minAvailable: 2was missing on the PDB, so two peers drained simultaneously. Re-add the PDB. - OOM kills during bulk upsert - default
indexing_threshold: 20000is too aggressive at scale. Raise to 100,000 or0(disable during bulk load) and runupdate_collectionto reindex afterwards. - Snapshots succeed but restores fail with “invalid snapshot” - the snapshot was downloaded with a client that corrupts binary data. Always use
curlwith-ooraws s3 cp --binary-mode. - Recall drops mysteriously after enabling quantization -
quantile: 0.99default trims outliers too aggressively for some distributions. Try1.0and measure.
What this connects to
Qdrant is the retrieval half of a RAG stack. The full production pattern:
- Embedding pipeline - vLLM or text-embeddings-inference serving the embedding model on GPU nodes
- LLM gateway - LiteLLM fronting the generation model
- Observability - Langfuse capturing retrieval + generation traces (see our Langfuse on K8s guide)
- Orchestration - LangGraph, LlamaIndex, or custom RAG service
Our upcoming pillar post stitches these together into a full reference architecture.
Getting help
We deploy and operate Qdrant in production for GCC AI teams running customer-facing RAG - including regulated fintech and government workloads on sovereign cloud. If you want a capacity plan, a topology review, or a migration from Pinecone or Qdrant Cloud, our AI/ML Infrastructure on K8s engagement is the entry point. Typical time to production: 2-4 weeks depending on data volume.
Frequently Asked Questions
How much memory does Qdrant need in production?
Qdrant holds the HNSW index in memory for every collection that is queried frequently. Budget roughly 1.5 KB of RAM per vector for a 768-dimension embedding at default HNSW parameters (m=16, ef_construct=100). A 10M-vector collection therefore needs ~15 GB of RAM just for the index, before payload storage or query working memory. At 100M vectors, use scalar or product quantization to compress the index by 4-32x.
Should I use the Qdrant Operator or the Helm chart?
Use the Qdrant Operator for production. The Helm chart is fine for single-node or static clusters, but the operator handles rolling upgrades with consensus awareness, automatic snapshot scheduling, shard rebalancing, and CRD-driven collection management. The chart requires manual coordination when scaling nodes or recovering from failures. For clusters under five nodes that rarely change, the chart is acceptable; above that, the operator pays for itself within a quarter.
How does Qdrant handle high availability?
Qdrant uses Raft consensus across cluster peers plus per-collection shard replication. Set replication_factor: 2 or 3 on production collections so each shard lives on multiple nodes. Combined with pod anti-affinity across K8s nodes and a minimum of 3 Qdrant peers, a single-node failure causes no downtime or data loss. Replication factor 1 is a common misconfiguration that looks like HA because there are multiple pods but actually has no fault tolerance.
How do I back up Qdrant running on Kubernetes?
Qdrant provides a native snapshot API that writes per-shard snapshots to local disk, which you then replicate to S3. The Qdrant Operator automates this via a QdrantSnapshot CRD with a cron schedule. For deployments without the operator, run a CronJob that calls POST /collections/{name}/snapshots and uploads the output to S3 with server-side encryption. Restore is per-collection, so coordinate multi-collection restores manually. Test the restore path quarterly - several teams we've worked with discovered their backup job had been silently failing for months.
What storage class should I use for Qdrant on EKS, AKS, and GKE?
Use fast NVMe-class SSDs: gp3 (EBS) or io2 on EKS, Premium_LRS or Ultra_LRS on AKS, and pd-ssd on GKE. Avoid standard HDD classes or network-attached cold storage - Qdrant performs lots of small random reads during HNSW construction and filtered queries. For extreme latency budgets, use local NVMe with a backup-and-replace strategy instead of network storage, but only if you have replication factor 2+ so you can tolerate node loss.
Can I run Qdrant in a UAE data-sovereign environment?
Yes. Qdrant is open-source and has no external dependencies beyond object storage for snapshots. Deploy it into an in-region Kubernetes cluster (Azure UAE North, AWS Middle East Bahrain or UAE, or a sovereign cloud like Core42) with the snapshot bucket in the same region. Qdrant Cloud hosts data in EU/US/APAC regions and is not suitable for workloads covered by NESA or CBUAE data residency requirements.
Get Started for Free
We would be happy to speak with you and arrange a free consultation with our Kubernetes Expert in Dubai, UAE. 30-minute call, actionable results in days.
Talk to an Expert