Infrastructure

Vector Database

Cơ sở dữ liệu vector

A database optimized to store and query high-dimensional vectors (embeddings), enabling fast semantic similarity search at scale.

A vector database is specialized storage built to handle the kind of queries embeddings require: "find the 10 vectors most similar to this query vector, from among 50 million." Traditional relational databases are not designed for this — SQL WHERE clauses don't do semantic similarity.

**How they work**

Vector databases use Approximate Nearest Neighbor (ANN) algorithms — like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) — to index vectors so similarity search runs in milliseconds even over millions of entries, without scanning every vector.

**Popular options**

- *Pinecone*: Fully managed, easy to start with, good for production. - *Weaviate*: Open-source, supports hybrid search (vector + keyword). - *Qdrant*: Open-source, Rust-based, high performance. - *pgvector*: Postgres extension — add vector search to your existing Postgres DB. - *Chroma*: Lightweight, good for local prototyping. - *Supabase Vector*: pgvector wrapped in Supabase's developer experience.

**Metadata filtering**

Most vector databases support filtering by metadata alongside the vector search — e.g., "find the 10 most similar documents, but only from category='finance' and date > 2024-01-01." This hybrid approach is critical for most real applications.

**When to use one**

Any RAG pipeline, semantic search feature, recommendation engine, or duplicate detection system at non-trivial scale. For fewer than ~10,000 vectors, even a simple in-memory list with cosine similarity in NumPy or a Postgres table with pgvector works fine.

**Pitfalls**

ANN algorithms trade accuracy for speed — results are approximate, not exact. Index tuning matters: wrong parameters lead to slow queries or poor recall. Data freshness is a concern since re-indexing large corpora takes time.