๐Ÿ”ด Advanced

Vector Database Optimization at Scale:
ANN Indexes, Quantization & Production Tuning

๐Ÿ—ƒ๏ธ Vector Databasesโฑ 18 min read๐Ÿ—“ May 2026

As your vector database grows from thousands to millions or billions of vectors, naive approaches break down. This guide covers the indexing algorithms, quantization techniques, and operational practices that make production vector search fast, accurate, and cost-effective at scale.

The Core Challenge: ANN vs Exact Search

Exact nearest neighbor search requires comparing the query vector against every stored vector โ€” O(n) time. With 1 billion vectors of 1536 dimensions, this is computationally infeasible in real time. Approximate Nearest Neighbor (ANN) search trades a small accuracy loss for orders-of-magnitude speedup.

ANN Index Algorithms

HNSW (Hierarchical Navigable Small World)

The dominant ANN algorithm in most vector databases. Builds a hierarchical graph where nodes have connections to nearby neighbors. Search traverses from coarse (top layer) to fine (bottom layer).

# HNSW parameters (using hnswlib directly)
import hnswlib
import numpy as np

dim = 1536
num_elements = 1_000_000

# Create index โ€” key parameters:
# M: number of connections per node (higher = better recall, more memory)
# ef_construction: search breadth during indexing (higher = better index, slower build)
index = hnswlib.Index(space='cosine', dim=dim)
index.init_index(
    max_elements=num_elements,
    ef_construction=200,   # Build quality. 200 is a good starting point.
    M=16                   # Connections per node. 16-64 typical range.
)

# Add elements
index.add_items(embeddings, ids=np.arange(num_elements), num_threads=8)

# Query โ€” ef controls search accuracy (higher = better recall, slower)
index.set_ef(100)  # Must be >= k
labels, distances = index.knn_query(query_vector, k=10)

# Save/load index
index.save_index("my_index.bin")
loaded = hnswlib.Index(space='cosine', dim=dim)
loaded.load_index("my_index.bin", max_elements=num_elements)

HNSW Parameter Tuning Guide

IVF (Inverted File Index)

IVF clusters vectors into nlist centroids using k-means, then at query time searches only the nearest nprobe clusters. Uses less memory than HNSW but lower recall at equivalent query time.

# FAISS with IVF index
import faiss

dim = 1536
nlist = 1000    # Number of clusters (sqrt(n) is a good starting point)
nprobe = 50     # Clusters to search at query time (higher = better recall, slower)

# Train on a representative sample
quantizer = faiss.IndexFlatIP(dim)   # Exact search for centroid finding
index = faiss.IndexIVFFlat(quantizer, dim, nlist, faiss.METRIC_INNER_PRODUCT)
index.train(training_vectors)         # Must train before adding!
index.add(all_vectors)

# Query
index.nprobe = nprobe
distances, indices = index.search(query_vectors, k=10)

Quantization: Reducing Memory 4โ€“32ร—

Full float32 vectors use 4 bytes ร— dimension = 6KB per 1536-dim vector. At 100M vectors, that's 600GB. Quantization compresses vectors with minimal accuracy loss.

Scalar Quantization (SQ): 4ร— compression

# float32 โ†’ int8 quantization (4ร— memory reduction)
import faiss

dim = 1536
sq_index = faiss.IndexScalarQuantizer(
    dim,
    faiss.ScalarQuantizer.QT_8bit,  # 8-bit = 4ร— reduction
    faiss.METRIC_INNER_PRODUCT
)
sq_index.train(training_vectors)
sq_index.add(all_vectors)
# Query is identical to regular index

Product Quantization (PQ): 8โ€“32ร— compression

# Product Quantization โ€” highest compression
dim = 1536
M = 96      # Number of sub-quantizers (dim must be divisible by M)
nbits = 8   # Bits per sub-quantizer (8 bits = 256 centroids)

pq_index = faiss.IndexPQ(dim, M, nbits)
pq_index.train(training_vectors)
pq_index.add(all_vectors)

# Or combine HNSW + PQ for best of both worlds:
hnsw_pq = faiss.IndexHNSWPQ(dim, M, 32, nbits)  # M PQ segments, 32 HNSW neighbors

Matryoshka Embeddings (MRL)

Modern embedding models (OpenAI text-embedding-3, Voyage AI) support Matryoshka Representation Learning โ€” you can truncate the embedding to a shorter dimension while preserving most quality. This lets you choose the precision-vs-cost tradeoff at query time.

# Using truncated OpenAI embeddings (MRL)
from openai import OpenAI
client = OpenAI()

# Full 1536-dim embedding: $0.13/1M tokens
full_response = client.embeddings.create(
    input="Sample text",
    model="text-embedding-3-large"
)
full_embedding = full_response.data[0].embedding  # 1536 dims

# Truncated 256-dim embedding: same cost, 6ร— less storage, ~95% quality
# Just take the first 256 dimensions and re-normalize
import numpy as np
truncated = np.array(full_embedding[:256])
truncated = truncated / np.linalg.norm(truncated)  # Renormalize

Incremental Indexing & Updates

# Problem: HNSW doesn't support deletion well. Use soft deletes.

class VectorStore:
    def __init__(self):
        self.index = hnswlib.Index(space='cosine', dim=1536)
        self.deleted_ids = set()
        self.metadata = {}

    def add(self, vectors, ids, metadata):
        self.index.add_items(vectors, ids)
        for id_, meta in zip(ids, metadata):
            self.metadata[id_] = meta

    def delete(self, id_):
        """Soft delete โ€” mark as deleted, filter at query time"""
        self.deleted_ids.add(id_)

    def search(self, query_vector, k=10) -> list:
        # Fetch more than needed to account for deleted items
        k_fetch = k + len(self.deleted_ids)
        labels, distances = self.index.knn_query(query_vector, k=min(k_fetch, 100))

        results = []
        for label, distance in zip(labels[0], distances[0]):
            if label not in self.deleted_ids:
                results.append({
                    "id": label,
                    "score": 1 - distance,
                    "metadata": self.metadata[label]
                })
            if len(results) == k:
                break
        return results

    def rebuild_index(self):
        """Periodic full rebuild to clean up deleted items"""
        active_ids = [id_ for id_ in self.metadata if id_ not in self.deleted_ids]
        # Re-index only active vectors
        ...

Benchmarking Your Vector DB

# Measure recall@k and latency
import time
import numpy as np

def benchmark_ann_index(index, true_neighbors, query_vectors, k=10):
    """Measure recall and latency of ANN search."""
    latencies = []
    recalls = []

    for i, query in enumerate(query_vectors):
        start = time.perf_counter()
        predicted, _ = index.knn_query(query.reshape(1, -1), k=k)
        latency = (time.perf_counter() - start) * 1000  # ms
        latencies.append(latency)

        # Recall@k = fraction of true top-k found in predicted top-k
        true_top_k = set(true_neighbors[i][:k])
        predicted_top_k = set(predicted[0])
        recall = len(true_top_k & predicted_top_k) / k
        recalls.append(recall)

    print(f"Mean latency: {np.mean(latencies):.2f}ms (p99: {np.percentile(latencies, 99):.2f}ms)")
    print(f"Mean recall@{k}: {np.mean(recalls):.3f}")
    print(f"Throughput: {1000/np.mean(latencies):.0f} QPS")

Production Operational Practices

The optimization priority order: First, fix your chunking and embedding quality (biggest impact). Second, tune ef/M parameters (recall vs latency tradeoff). Third, apply quantization (cost reduction). Only then optimize operational infrastructure. Most teams optimize in the wrong order.