Glossary

alignment

The training-and-evaluation work of shaping a model's behavior to match human intent, refuse harmful requests, and answer honestly, distinct from raw capability training.

Training also: Safety and Guardrails also: Governance aka alignment tuning

The cluster of techniques that turn a raw pretrained next-token predictor into a model that follows instructions, refuses unsafe requests, and produces answers in human-preferred styles. It is post-trainingtrainingEverything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant. Open full entry work: the base model already has the capability; alignment adds the behavioral shaping layer.

The component techniques are familiar: supervised fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry on demonstration data, preference learning via RLHF or DPO, red-teaming to surface failure modes, and constitutional approaches that train the model to critique and revise its own outputs against written principles.

The term carries political weight: at frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry labs it covers everything from “don’t help build bioweapons” to “answer in a helpful tone.” Critics observe that “alignment” elides real disagreements about what behaviors should be enforced, by whom, and against whose objections. Open-source projects increasingly publish their alignment datasets and training scripts so the choices are auditable.

Sources

Mentioned in

Back to glossary