Glossary

DPO

A preference-tuning method that optimizes a model on pairwise human rankings directly, bypassing the reward-model and reinforcement-learning steps of RLHF.

Training also: Safety and Guardrails aka direct preference optimization

A reformulation of RLHF where the implicit reward model is the ratio between the trained policy and a frozen reference policy. The training objective is a simple classification loss over pairs of preferred and dispreferred responses, which makes it stable and cheap compared to RL.

Practically: feed in a dataset of (prompt, chosen, rejected) triples collected from human or AI ranking, and run gradient descent. No reward model to train, no PPO loop, no KL coefficient to tune. The reference policy is usually the SFT-initialized model itself.

DPO became the default open-source preference-tuning method in 2024 because it works, fits on small hardware, and the Hugging FacetrainingThe model hub, dataset hub, and open-source library suite (Transformers, Datasets, Tokenizers, Accelerate, PEFT, TRL) that anchors the open-AI ecosystem's distribution and tooling layer. Open full entry TRLtrainingHugging Face's library for preference and reinforcement learning on transformer models, the canonical open implementation of RLHF, DPO, KTO, ORPO, and related preference-tuning methods. Open full entry library makes it a one-liner. Variants include KTO (Kahneman-Tversky, needs only binary good/bad signals, no pairs) and ORPO (Odds Ratio, combines SFT and DPO in one pass).

Sources

Mentioned in

Back to glossary