Glossary

Llama Guard

Meta's open content-moderation model line, designed to classify prompts and responses against a configurable taxonomy of harms, deployable as an input/output filter.

Safety and Guardrails also: Weights

A small open model trained to classify text against a taxonomy of unsafe categories (violence, hate, sexual content, weapons, criminal planning, etc). The intended deployment pattern is filtering both user prompts and model responses: a LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry Guard call wraps each generation, blocking or flagging outputs that violate the configured policy.

The taxonomy is editable, and the model handles instruction-style policy definitions reasonably well, so a team can configure “acceptable” per their own use case rather than accepting Meta’s defaults. LlamaweightsMeta's open-weight model family, the most widely deployed open release through 2024 to 2026, released under the source-available Community License with an MAU cap and acceptable-use clause. Open full entry Guard 3 (July 2024) expanded language coverage to eight languages; the separately released Llama Guard 3 Vision (November 2024) added image-input classification.

Sources

Llama Guard documentation

Mentioned in

Back to glossary