# LLM Red Teaming and Adversarial Evaluation
Status: public
Confidence: medium (0.725) (verified)
Last verified: 2026-06-02
Generation: ai_structured


## TL;DR

LLM red teaming is adversarial evaluation for model and application behavior, especially around jailbreaks, prompt injection, data leakage, unsafe tool use, and policy bypass.

## Core Explanation

Standard eval sets usually measure expected behavior. Red teaming asks what happens when the user, retrieved document, tool output, or environment is adversarial. This matters for agents because tool access increases impact beyond text generation.

Useful red-team work produces reproducible cases, severity labels, mitigations, and regression tests. A finding that cannot be replayed or tied to a release gate is weak operational evidence.

## Source-Mapped Facts

- NIST AI Risk Management Framework material identifies NIST-AI-600-1 as a generative AI profile released in 2024. ([source](https://www.nist.gov/itl/ai-risk-management-framework))
- Google describes AI red teaming as a capability for simulating attacks against AI systems. ([source](https://blog.google/technology/safety-security/googles-ai-red-team-the-ethical-hackers-making-ai-safer/))
- OWASP publishes the Top 10 for Large Language Model Applications as a list of LLM application security risks. ([source](https://genai.owasp.org/llm-top-10/))

## Further Reading

- [NIST AI Risk Management Framework](https://www.nist.gov/itl/ai-risk-management-framework)
- [Google AI Red Team](https://blog.google/technology/safety-security/googles-ai-red-team-the-ethical-hackers-making-ai-safer/)
- [OWASP Top 10 for LLM Applications](https://genai.owasp.org/llm-top-10/)