Glossary
DPO
A preference-tuning method that optimizes a model on pairwise human rankings directly, bypassing the reward-model and reinforcement-learning steps of RLHF.
A reformulation of RLHF where the implicit reward model is the ratio
between the trained policy and a frozen reference policy. The training
objective is a simple classification loss over pairs of preferred and
dispreferred responses, which makes it stable and cheap compared to RL.
Practically: feed in a dataset of (prompt, chosen, rejected) triples
collected from human or AI ranking, and run gradient descent. No reward
model to train, no PPO loop, no KL coefficient to tune. The reference
policy is usually the SFT-initialized model itself.
DPO became the default open-source preference-tuning method in 2024
because it works, fits on small hardware, and the Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer.
Open full entry TRLtrainingHugging Face's library for preference and reinforcement learning on transformer models, the canonical open implementation of RLHF, DPO, KTO, ORPO, and related preference-tuning methods.
Open full entry
library makes it a one-liner. Variants include KTO (Kahneman-Tversky,
needs only binary good/bad signals, no pairs) and ORPO (Odds Ratio,
combines SFT and DPO in one pass).