How-LLMs-work track
How LLMs work
The model-side foundation. Start with the loop: tokens in, probabilities out, one next token at a time. Once that clicks, attention, the KV cache, chat templates, decoding, long context, RAG, tool use, and fine-tuning all fall out of the same mechanics. Each module follows the same Read / Probe / Compare / Why-Open / Synthesize structure as the rest of the course.
Adapted, in this site's neutral voice, from Ahmad Osman's "LLMs 101: A Practical Guide (2026)". The hardware and serving half is the Self-host track, which adapts the same author's "Self-hosted LLMs / Local AI" series.
- 01 The inference loop Tokens in, probabilities out, one next token at a time. Every other decision follows from this loop.
- 02 Tokens and tokenizers Models read integer token IDs, not words. Token counts, not word counts, set context limits and memory.
- 03 The Transformer Most chat models are decoder-only Transformers: embeddings, attention, feed-forward blocks, stacked and projected to logits.
- 04 Attention Attention decides which earlier tokens matter. The variant chosen (MHA, MQA, GQA, MLA) sets the KV-cache bill.
- 05 The KV cache The model's working memory. It keeps generation usable and it is the hidden memory bill that grows with every token.
- 06 Prefill and decode Two regimes with different costs. Prefill processes the prompt (time to first token); decode generates one token at a time (streaming speed).
- 07 Decoding controls After logits, nothing is written yet. Decoding turns scores into one token, and the knobs change voice, determinism, and risk.
- 08 Model packages and chat templates Weights are not the whole model. Config, tokenizer, chat template, and generation defaults travel together, and the template is the part most often broken.
- 09 Model types Base, instruct, chat, reasoning, tool-tuned. Starting with the wrong type is a common reason a capable model feels useless.
- 10 Long context 128K, 256K, 1M tokens is useful but not free. It is expensive attention, not a free notebook, and not a replacement for retrieval.
- 11 RAG: retrieval-augmented generation Retrieve relevant chunks and give only those to the model. Most bad RAG is bad retrieval and chunking, not a bad model.
- 12 Tool use and agents Tools make a model useful and change the safety model. A chatbot that hallucinates is annoying; an agent with shell access is dangerous.
- 13 Fine-tuning LoRA and QLoRA change behavior cheaply, but fine-tuning is the last lever, not the first. Try template, prompt, model, and RAG first.
- 14 Multimodal models Images, audio, and video become tokens too. The non-text input is a memory cost and a new way to get the template wrong.