Safety and Guardrails

What it is

Overview

The tools and techniques for constraining what a model or agent is allowed to do. Distinct from the trust-and-identity layer (which is about verifying what happened); this layer is about preventing certain outcomes in the first place.

Five things to keep in mind as you read:

Safety applies at three points in the pipeline. Training time, inference time, agent runtime. Different tools at each.
Training-time safety shapes the base model. constitutional AIsafety-guardrailsAnthropic's alignment technique where a model is trained to critique and revise its own outputs against a written list of principles (the "constitution"), reducing the need for human ranking labels. Open full entry , RLHFtrainingA post-training pipeline that uses human preference rankings to train a reward model, then optimizes a base model against that reward via reinforcement learning. Open full entry , DPOtrainingA preference-tuning method that optimizes a model on pairwise human rankings directly, bypassing the reward-model and reinforcement-learning steps of RLHF. Open full entry , RLAIF. The model “learns” what not to do.
Inference-time safety filters at the API boundary. Llama Guardsafety-guardrailsMeta's open content-moderation model line, designed to classify prompts and responses against a configurable taxonomy of harms, deployable as an input/output filter. Open full entry , NeMo Guardrailssafety-guardrailsNVIDIA's open framework for programmable safety, topic, and conversation guardrails around LLM applications, using a Colang DSL to define allowed and disallowed conversation flows. Open full entry , Granite Guardian.
Agent runtime safety is the newest area. Sandbox escapes, capability tokens, action-policy enforcement.
External evaluators are the load-bearing accountability layer. AISI, METR, Apollo Research. Without them, safety claims are uncheckable.

The rest of this page walks the three pipeline points, then the evaluator counterweight, then arrives at the verifiability question.

Training-time safety

What the lab does to shape the base model’s behavior before release.

The major techniques:

RLHF (Reinforcement Learning from Human Feedback) — the foundational technique. Human raters rank model outputs, a reward model learns from the rankings, the base model gets fine-tuned against the reward model. Christiano et al. 2017 (Deep RL from Human Preferences) is the canonical reference; OpenAI’s InstructGPT (2022) was the first widely-deployed application.
Constitutional AI (Anthropic) — uses an explicit “constitution” (a list of principles) plus model-generated critique-and-revision instead of pure human feedback. Reduces the human labeling burden (Constitutional AI paper).
DPO (Direct Preference Optimization, Rafailov et al. 2023) — skips the explicit reward model; directly optimizes the policy against preference pairs. Much cheaper than RLHF (DPO paper). The open-finetuning-community default in 2025-2026.
RLAIF (RL from AI Feedback) — the labeler is another model instead of a human. Used in combination with Constitutional AI.

The open ecosystem has all of these techniques implemented in the standard libraries (TRL, HuggingFace, the Axolotl recipes). The closed labs have more sophisticated internal variants (Anthropic’s HHH framework, OpenAI’s deliberative alignment) that don’t get fully published. The gap between published techniques and internal techniques is hard to estimate; the labs assert it’s substantial.

Inference-time safety (guard models)

Filters at the API boundary. The pattern: before sending a prompt to a model, classify it with a small guard model; if classified as a violation, refuse. Similarly for outputs before returning to the user.

The major open guards:

Llama Guard (Meta) — currently at Llama Guard 4. Trained to classify prompts and responses against a harm taxonomy. Apache 2.0 (Llama Guard 4 on HuggingFace).
NeMo Guardrails (NVIDIA) — a programmable safety framework that lets you define guardrails as code (Colang DSL); broader than just a guard model (NeMo Guardrails repository).
Granite Guardian (IBM) — IBM Research’s open guard model family, comparable in scope to Llama Guard (Granite Guardian release).
ShieldGemma (Google) — the Gemma-family guard, used in the Google ecosystem (ShieldGemma model card).

The closed equivalents are inside the API surface of the major labs (Anthropic’s classifier, OpenAI’s moderation API). The open guards are at rough parity for the categories that have public training data (CSAM, violent extremism, self-harm); the closed labs are typically ahead on categories that require careful definitions (CBRN, manipulation, deception).

Agent-runtime safety

The newest area. As agents start executing arbitrary code, browsing the web, and taking actions on user accounts, the relevant safety question shifts from “did the model output something bad?” to “did the agent do something the user didn’t intend?”.

The techniques:

Sandboxing — agents run in isolated containers or VMs that limit what they can affect. Standard for code-execution agents (OpenHands runs in a Docker sandbox; Goose can run in one).
Sandbox-escape evals — explicit testing of whether an agent will try to break out of its sandbox under adversarial conditions. METR has published the most-systematic public work here (METR sandbox-escape evaluations).
Capability tokens — the agent must present a token authorizing each action. Lets the user / platform constrain the agent’s reach in fine-grained ways. AAIF (Layer 13) is the protocol-level expression of this pattern.
Action policies — declarative rules about what the agent can and cannot do (e.g., “can read filesystem; cannot network outside allowlist; cannot call shell commands matching X”).

The open ecosystem here is largely the agent-product community (Goose, OpenHands, Aider) implementing sandboxing as part of their products. Generic agent-runtime-safety frameworks (comparable to Llama Guard for outputs) don’t really exist yet.

The evaluator counterweight

The accountability layer. Three organizations do most of the externally-credible safety evaluation of frontier models.

AISI (UK AI Safety Institute) — runs pre-release evaluations on frontier models from Anthropic, OpenAI, Google DeepMind under voluntary agreements. The most-funded public safety-evaluation org (AISI blog).
METR (Model Evaluation and Threat Research) — focused on autonomy and agentic capabilities; specifically the question “is this model capable of taking dangerous actions over long horizons?” (METR research).
Apollo Research — focused on deception, scheming, and in-context reasoning failures. The most-public work on “does the model deceive its evaluators?” (Apollo research).
MLCommons AILuminate — industry-consortium safety benchmark, less rigorous than AISI / METR but standardized (AILuminate).

These are the closest the field has to independent verification of safety claims. Their methods are public; the specific results from specific model-lab pairs are usually under NDA. The arrangement is uncomfortable but is currently the only external scrutiny that exists at all.

What’s open and what isn’t

Open training-time techniques: RLHF, DPO, Constitutional AI are all published and have open implementations. The closed-lab variants are more sophisticated and unpublished.
Open inference-time guards: Llama Guard, NeMo Guardrails, Granite Guardian, ShieldGemma. Cover the standard harm taxonomy categories.
Closed inference-time guards: the classifier stacks inside Anthropic’s, OpenAI’s, and Google’s API surfaces.
Open agent-runtime safety: ad-hoc, mostly per-product sandboxing. No general framework yet.
Open evaluator work: AISI, METR, Apollo publish methods and aggregate results. Specific lab-pair results typically not.
Closed lab internal safety stacks: the most substantial safety work in the field, almost entirely unpublished by design.

The reverse-lock-in risk is that closed-lab safety claims cannot be independently verified. “Our model is safe because our internal evals say so” is a claim the public has no way to check. The AISI / METR / Apollo work is the partial answer; it depends on lab cooperation.

The editorial tension

The argument for closed safety stacks is principled: detailed safety techniques (how to red-team for CBRN uplift, how to detect deception, what specifically the model knows about bioweapons) are themselves dangerous information, and publishing them helps adversaries as much as defenders. The labs that take this position are not unreasonable; the calculation about disclosure has different weights for safety-relevant information than for capability-relevant information.

The argument for open safety stacks is verification. A safety claim that no external party can check is just marketing. The external evaluators (AISI / METR / Apollo) are partial verification; they depend on labs voluntarily participating and don’t get to publish the most-damaging findings.

The strategic question is whether the partial-verification arrangement scales as frontier capabilities grow, or breaks down at some point and the field needs a more confrontational disclosure regime. The 2026 trajectory is the partial one; whether it survives the first major capability surprise that external evaluators didn’t predict is the open question.

What it is

Overview

Training-time safety

Inference-time safety (guard models)

Agent-runtime safety

The evaluator counterweight

What’s open and what isn’t

The editorial tension

Key projects

Grants flowing in

Reading list

Papers

Posts

Docss