Retrieval-Augmented Generation (RAG) — What It Is and Why It Matters

Retrieval-Augmented Generation (RAG) is an architecture that adds a retrieval step before the LLM generates a response. Instead of relying solely on knowledge baked into model weights during training, the system searches an external knowledge base, fetches the most relevant passages, and injects them into the prompt as context.

**The two-step flow**

1. *Retrieval*: The user's query is converted to an embedding and compared against a vector database of pre-indexed documents. The top-k most semantically similar chunks are returned. 2. *Generation*: Those chunks are inserted into the prompt alongside the query. The LLM answers based on the provided context rather than (or in addition to) its training knowledge.

**Why RAG exists**

LLMs have fixed knowledge cutoffs and can confidently hallucinate facts. RAG addresses both: retrieved documents are current (you control the knowledge base), and the model can cite sources, which auditors can verify.

**When to use it**

- Internal Q&A over company docs, Notion wikis, or codebases. - Customer support bots that need accurate product information. - Any domain where recency and factual accuracy matter more than creativity.

**Pitfalls**

RAG quality is limited by retrieval quality. If the wrong chunks are fetched, the LLM generates wrong answers. Chunking strategy, embedding model choice, and reranking all affect results significantly. Long documents can exceed the context window after retrieval. And RAG adds latency — two round trips instead of one.