Embeddings
Nhúng vector (Embeddings)
Dense numerical vectors that represent the semantic meaning of text, enabling similarity comparisons between pieces of content.
An embedding is a list of floating-point numbers (a vector) that encodes the semantic meaning of a piece of text. Two texts with similar meaning will have vectors that are close together in vector space; unrelated texts will be far apart.
**How they're generated**
Embedding models (like OpenAI's text-embedding-3 or Sentence Transformers) take text as input and output a fixed-length vector — typically 768 to 3,072 numbers. They're trained to push semantically similar texts close together.
**What "distance" means**
Cosine similarity is the most common measure: a score of 1.0 means identical direction (very similar meaning), 0 means perpendicular (unrelated). Dot product and Euclidean distance are also used.
**Core use cases**
*Semantic search*: Convert a query and a corpus of documents to embeddings, then rank documents by similarity to the query. Unlike keyword search, this matches meaning — "car" will match "automobile."
*RAG retrieval*: The first step in most RAG pipelines — embed your chunks, store them in a vector database, embed the query, and return the nearest neighbors.
*Clustering and classification*: Group similar content automatically or use embeddings as features for downstream ML models.
*Duplicate detection and recommendations*: Find near-duplicate content or recommend similar articles.
**Pitfalls**
Embeddings capture semantics but not factual correctness. Two contradictory sentences can have similar embeddings if they discuss the same topic. Embeddings are also language-agnostic only to a degree — cross-lingual models vary in quality. And embedding quality depends heavily on the model used; switching models requires re-embedding your entire corpus.