Glossary

prompt injection

An attack where adversarial content in a document, tool result, or web page is interpreted as instructions by the model, overriding the user or system prompt.

Safety and Guardrails also: Agents aka indirect prompt injection

The defining LLM security failure. A model receives input from multiple sources (a system prompt, a user message, a retrieved document, a tool result, the contents of a web page being processed). A model trained to follow instructions in the prompt cannot reliably distinguish instructions in trusted versus untrusted sources, so adversarial text in a retrieved document can override the system prompt’s safety constraints or tool-use policy.

Indirect prompt injection (the Greshake et al. naming) is the production form: not the user typing malicious instructions, but the user uploading a document, browsing a page, or processing an email whose contents contain instructions the user did not author.

No fully reliable defense exists in 2026. The mitigations stack: instruction hierarchies in post-trainingtrainingEverything that happens after pretraining ends: supervised fine-tuning, preference optimization, red-teaming, distillation, and safety work that turns a base into a shippable assistant. Open full entry , structured separation of trusted and untrusted content in the prompt, output filtering for suspicious actions, capability limits on what a tool can do, and human-in-the-loop for high-risk actions. Most production agent deployments rely on capability limiting more than on model-level defense.

Sources

Mentioned in

Back to glossary