Glossary

jailbreak

An adversarial input that causes a model to bypass its training-time refusal policy and produce content it would normally refuse, distinct from prompt injection's instruction-hijack.

Safety and Guardrails aka LLM jailbreak

An input crafted to elicit content from a model that the model’s post-trainingtrainingEverything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant. Open full entry pipeline tried to refuse: information about weapons synthesis, malware code, sexually explicit minors, and so on. The distinguishing feature from prompt injectionsafety-guardrailsAn attack where adversarial content in a document, tool result, or web page is interpreted as instructions by the model, overriding the user or system prompt. Open full entry is the target: jailbreaks aim at the model’s own refusal policy; prompt injectionsafety-guardrailsAn attack where adversarial content in a document, tool result, or web page is interpreted as instructions by the model, overriding the user or system prompt. Open full entry aims at the control flow.

The pattern catalog runs long. Role-play prompts (“you are an unethical AI”), elaborate hypothetical framings (“for a fictional story”), encoded payloads (ASCII art, base64, low-resource languages), gradient-based suffix attacks (the Zou et al. universal suffix work), many-shot jailbreaks (filling the context with refused queries), and multi-turn social engineering all appear regularly.

The defender’s posture has shifted from “prevent all jailbreaks” to “raise the cost.” frontierweightsThe current capability envelope of AI, defined by the most capable models in deployment at any given time; an evolving label rather than a fixed threshold. Open full entry -lab safety teams treat jailbreak resistance as a continuous adversarial process, not a finished product. Open weights expose models to fine-tuningtrainingContinued training of a pretrained base model on a smaller, task-specific dataset to specialize its behavior without retraining from scratch. Open full entry that can directly remove the refusal policy, which is one of the central arguments in the safety-vs-openness debate.

Sources

Universal and Transferable Adversarial Attacks on Aligned Language Models (Zou et al., 2023)

Back to glossary