The Open-Source AI Stack
RSS
← All learn tracks

Self-host track

Self-host the stack

A practical companion to the stack walk. Each module follows the same Read / Probe / Compare / Why-Open / Synthesize structure as the main course, applied to the practical "how do I actually run this?" topics.

Inspired by the @TheAhmadOsman "Self-hosted LLMs / Local AI" series: GPU Memory Math, Memory Bandwidth, and LLM Inference Engines.

  1. 01 GPU memory math VRAM ≈ parameters × (bits ÷ 8). The one formula that explains every model-fits-or-doesn't question.
  2. 02 Memory bandwidth Capacity decides what fits. Bandwidth decides how fast it runs. They are not the same.
  3. 03 Quantization formats GGUF, GPTQ, AWQ, NF4, EXL2, EXL3, FP8, FP4, MLX, ONNX. None are interchangeable.
  4. 04 Inference engines The traffic cop, memory manager, scheduler, and API surface that turns hardware into served tokens.
  5. 05 Hardware strategy Pick a hardware strategy and workload shape first; the engine follows.
  6. 06 Production serving Prefill, decode, batching, scheduling, parallelism. The system around the model.
  7. 07 Benchmarking and operations Bad benchmark: 180 tok/s. Good benchmark: TTFT, TPOT, p95, cost per million tokens, at your workload shape.