Glossary

RLHF

A post-training pipeline that uses human preference rankings to train a reward model, then optimizes a base model against that reward via reinforcement learning.

Training also: Safety and Guardrails also: Governance aka reinforcement learning from human feedback

A three-stage recipe. Stage 1: supervised fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry on human-written demonstrations. Stage 2: collect pairwise rankings of model outputs from human annotators and train a reward model to predict which response a human would prefer. Stage 3: use a reinforcement learning algorithm (historically PPO) to optimize the base model against the reward model, with a KL penalty to keep the policy near the SFT initialization.

This was the canonical alignmenttrainingThe training-and-evaluation work of shaping a model's behavior to match human intent, refuse harmful requests, and answer honestly, distinct from raw capability training. Open full entry recipe behind InstructGPT, ChatGPT, and the first wave of Claude and LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry -Chat models. The reward model is what the rest of the pipeline depends on: if it is miscalibrated, RL amplifies the miscalibration, producing “reward hacking” outputs that score high but fail in deployment.

DPO and its variants replaced RLHF as the open-source default during 2024 because they skip the reward model and the RL loop entirely, using the pairwise rankings directly as a contrastive training signal. RLHF is still common at frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry labs but rare in open-source fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry stacks.

Sources

Mentioned in

Back to glossary