How-LLMs-work track

How LLMs work

The model-side foundation. Start with the loop: tokens in, probabilities out, one next token at a time. Once that clicks, attention, the KV cache, chat templates, decoding, long context, RAG, tool use, and fine-tuning all fall out of the same mechanics. Each module follows the same Read / Probe / Compare / Why-Open / Synthesize structure as the rest of the course.

Adapted, in this site's neutral voice, from Ahmad Osman's "LLMs 101: A Practical Guide (2026)". The hardware and serving half is the Self-host track, which adapts the same author's "Self-hosted LLMs / Local AI" series.

01 The inference loop Tokens in, probabilities out, one next token at a time. Every other decision follows from this loop.
02 Tokens and tokenizers Models read integer token IDs, not words. Token counts, not word counts, set context limits and memory.
03 The Transformer Most chat models are decoder-only Transformers: embeddings, attention, feed-forward blocks, stacked and projected to logits.
04 Attention Attention decides which earlier tokens matter. The variant chosen (MHA, MQA, GQA, MLA) sets the KV-cache bill.
05 The KV cache The model's working memory. It keeps generation usable and it is the hidden memory bill that grows with every token.
06 Prefill and decode Two regimes with different costs. Prefill processes the prompt (time to first token); decode generates one token at a time (streaming speed).
07 Decoding controls After logits, nothing is written yet. Decoding turns scores into one token, and the knobs change voice, determinism, and risk.
08 Model packages and chat templates Weights are not the whole model. Config, tokenizer, chat template, and generation defaults travel together, and the template is the part most often broken.
09 Model types Base, instruct, chat, reasoning, tool-tuned. Starting with the wrong type is a common reason a capable model feels useless.
10 Long context 128K, 256K, 1M tokens is useful but not free. It is expensive attention, not a free notebook, and not a replacement for retrieval.
11 RAG: retrieval-augmented generation Retrieve relevant chunks and give only those to the model. Most bad RAG is bad retrieval and chunking, not a bad model.
12 Tool use and agents Tools make a model useful and change the safety model. A chatbot that hallucinates is annoying; an agent with shell access is dangerous.
13 Fine-tuning LoRA and QLoRA change behavior cheaply, but fine-tuning is the last lever, not the first. Try template, prompt, model, and RAG first.
14 Multimodal models Images, audio, and video become tokens too. The non-text input is a memory cost and a new way to get the template wrong.