10 Long context

core

128K, 256K, 1M tokens is useful but not free. It is expensive attention, not a free notebook, and not a replacement for retrieval.

Adapted from Ahmad Osman, "LLMs 101: A Practical Guide (2026)".

Long context sounds magical: 128K, 256K, or even 1M tokens in one prompt. It is useful, but it has real costs. More context means more KV cacheruntimeThe stored key and value vectors from previously processed tokens, reused at each generation step so an autoregressive model does not recompute attention over the entire prefix. Open full entry memory, slower prompt processing, more attention work, harder evaluation, and more ways for irrelevant text to distract the model. Quality can also decay across distance: a model may handle the start and end of a long document well while missing a critical detail buried in the middle.

Long context is good for whole-document analysis, codebase slices, legal or technical review, transcript summarization, multi-file reasoning, and as a fallback when retrieval misses context. But it is a complement to retrieval, not a replacement. Use RAGretrieval-memoryA pattern where a model retrieves relevant documents from an external store at query time and conditions its answer on them, instead of relying only on parametric knowledge. Open full entry for large corpora and long context for the final selected evidence.

A few habits make long context behave. Put critical instructions near the beginning and near the end. Use section headers and delimiters. Ask for citations tied to source chunks. Compress irrelevant history, and use summary memory instead of unbounded chat history. Think of long context as expensive attention, not a free notebook.

Supported context length is also not the same as fast, cheap, or accurate at that length, the same caution that applies to the context window itself. The next module, on retrieval, is the other half of this story: when the corpus is larger than any context window, retrieval is what selects the evidence the model actually sees.