Self-host track
Self-host the stack
A practical companion to the stack walk. Each module follows the same Read / Probe / Compare / Why-Open / Synthesize structure as the main course, applied to the practical "how do I actually run this?" topics.
Inspired by the @TheAhmadOsman "Self-hosted LLMs / Local AI" series: GPU Memory Math, Memory Bandwidth, and LLM Inference Engines.
- 01 GPU memory math VRAM ≈ parameters × (bits ÷ 8). The one formula that explains every model-fits-or-doesn't question.
- 02 Memory bandwidth Capacity decides what fits. Bandwidth decides how fast it runs. They are not the same.
- 03 Quantization formats GGUF, GPTQ, AWQ, NF4, EXL2, EXL3, FP8, FP4, MLX, ONNX. None are interchangeable.
- 04 Inference engines The traffic cop, memory manager, scheduler, and API surface that turns hardware into served tokens.
- 05 Hardware strategy Pick a hardware strategy and workload shape first; the engine follows.
- 06 Production serving Prefill, decode, batching, scheduling, parallelism. The system around the model.
- 07 Benchmarking and operations Bad benchmark: 180 tok/s. Good benchmark: TTFT, TPOT, p95, cost per million tokens, at your workload shape.