Self-host track

Self-host the stack

A practical companion to the stack walk. Each module follows the same Read / Probe / Compare / Why-Open / Synthesize structure as the main course, applied to the practical "how do I actually run this?" topics.

Inspired by the @TheAhmadOsman "Self-hosted LLMs / Local AI" series: GPU Memory Math, Memory Bandwidth, and LLM Inference Engines.

Core path · one sitting

01 GPU memory math VRAM ≈ parameters × (bits ÷ 8). The one formula that explains every model-fits-or-doesn't question.
02 Memory bandwidth Capacity decides what fits. Bandwidth decides how fast it runs. They are not the same.
03 Quantization formats GGUF, GPTQ, AWQ, NF4, EXL2, EXL3, FP8, FP4, MLX, ONNX. None are interchangeable.
04 Inference engines The traffic cop, memory manager, scheduler, and API surface that turns hardware into served tokens.
05 Hardware strategy Pick a hardware strategy and workload shape first; the engine follows.

Optional deep-dives · when you have more time