Fundamentals

Context Window

Cửa sổ ngữ cảnh

The maximum number of tokens an LLM can process in a single request, covering both input and output.

The context window is the total number of tokens — input plus output — that an LLM can "see" and process in one request. Everything outside this window is invisible to the model.

**Why it matters**

If your context window is 128K tokens and you paste in a 200K-token document, the model simply cannot process it all. You must chunk the document, summarize it, or use retrieval techniques to work around the limit.

**Current scale**

Context windows have grown dramatically. Early GPT-3 had 4K tokens (~3,000 words). GPT-4 Turbo and Claude 3.5 Sonnet offer 128K–200K tokens. Google Gemini 1.5 Pro reached 1M tokens. Even at 200K tokens, that's roughly a 400-page book.

**Practical limits below the hard limit**

Models degrade in quality as the context fills up — a phenomenon called "lost in the middle," where information at the start and end of long contexts is retrieved more reliably than information in the middle. Staying well under the limit improves accuracy.

**Context vs memory**

Context window is not persistent memory. Each request starts fresh. If you want the model to remember information across sessions, you must re-inject it each time (system prompt, retrieved memories) or use tools that manage state externally.

**Costs scale with context**

API pricing is typically per token. A 100K-token context costs significantly more than a 1K-token context. In production, minimizing context reduces costs meaningfully.