Glossary

constitutional AI

Anthropic's alignment technique where a model is trained to critique and revise its own outputs against a written list of principles (the "constitution"), reducing the need for human ranking labels.

Safety and Guardrails also: Training aka CAI

A training recipe with two stages. In the SL (supervised) stage the model generates a response, critiques itself against the constitution (“would a thoughtful reader find this response unhelpful?”), revises the response based on the critique, and the revised version is used as the SFT target. In the RL stage the model ranks pairs of responses against the constitution to produce a preference dataset, which then trains a reward model used in RLHFtrainingA post-training pipeline that uses human preference rankings to train a reward model, then optimizes a base model against that reward via reinforcement learning. Open full entry .

The constitution is a small set of plain-English principles (Claude’s public constitution lists several dozen). The shift is moving the behavioral specification from labor-intensive per-example human ranking to a relatively concise written document the team can edit and audit.

Beyond Anthropic, related AI-feedback techniques (RLAIF, principle- guided synthetic preference data) show up in open post-training recipes including AI2’s Tulu line and HuggingFace’s Zephyr family, though those don’t always adopt the exact “constitution” framing. The open question for governance is whether having a publishable constitution is itself a meaningful accountability mechanism or whether the harder work is the (often unpublished) red-teaming and refinement of what the constitution actually achieves.

Sources

Mentioned in

layer Safety and Guardrails

Back to glossary