AI Security · · 1 min read

AI Red Teaming: How We Attack AI Systems Before Someone Else Does

AI red teaming is structured adversarial testing of an AI system: we try to make your model leak data, ignore its guardrails, and abuse its tools, then hand you reproducible findings and fixes.

Vadim L Vadim L
cyber72 · operator
AI Red Teaming: How We Attack AI Systems Before Someone Else Does
On this page 15
  1. How is AI red teaming different from a normal pentest?
  2. What we actually test
  3. Prompt injection
  4. Jailbreaks and guardrail bypass
  5. Data extraction and leakage
  6. Model abuse and excessive agency
  7. Bias and safety failures
  8. Our methodology, start to finish
  9. 1. Scoping
  10. 2. Threat modeling with MITRE ATLAS
  11. 3. Manual and automated adversarial testing
  12. 4. Reporting with reproducible findings
  13. How does AI red teaming map to compliance?
  14. When should you red-team an AI system?
  15. FAQ

AI red teaming is structured adversarial testing of an AI system, where a dedicated team deliberately tries to make the model misbehave: leak training data or secrets, ignore its own guardrails, follow malicious instructions hidden in content it reads, or abuse the tools it can call. The goal is the same as any red team engagement. Find the ways the system fails under attack before a real adversary does, then prove each finding is real and hand the owner a fix. The difference is the target. Instead of a network or a web app, you’re attacking a probabilistic model whose behavior shifts with the input and can’t be fully predicted from the code.

This is the work we do most often now, and it’s the most natural fit for an offensive team. The skill that matters in AI red teaming is the same one that matters in pen testing: thinking like an attacker who doesn’t care about your intended use case. The model was trained to be helpful. We get paid to make that helpfulness work against you.

How is AI red teaming different from a normal pentest?

A traditional pentest targets deterministic systems. The same request returns the same response, vulnerabilities map to known classes like SQL injection or broken access control, and a finding either reproduces or it doesn’t. AI red teaming breaks most of those assumptions.

The target is non-deterministic. Send the same jailbreak twice and the model might refuse once and comply the next time. A finding isn’t a clean true-or-false; it’s a success rate. The attack surface is the language itself, so the input space is effectively infinite and you can’t enumerate it the way you’d enumerate parameters in an API. And the most damaging issues often aren’t memory-safety bugs but behavioral ones: the model does exactly what an attacker asked, just not what its owner wanted.

So the methodology adapts. We still scope, threat-model, attack, and report. But we measure findings probabilistically, lean heavily on creative manual testing alongside automation, and treat the system around the model, like retrieval pipelines, tools, and downstream actions, as part of the target rather than background. The traditional discipline still matters, which is why AI red teaming sits inside our broader penetration testing and vulnerability management practice rather than off in its own silo.

A red team probes the system from every angle at once, hunting for the single point where it breaks.
A red team probes the system from every angle at once, hunting for the single point where it breaks.

What we actually test

Every engagement is scoped to the system in front of us, but the categories below cover most of what we go after. They line up closely with the OWASP Top 10 for LLM Applications, which is a good shared vocabulary for talking about these risks with your engineers.

Prompt injection

The signature attack against language models, and OWASP’s LLM01. Direct injection is the user telling the model to ignore its instructions. The nastier variant is indirect: malicious instructions planted in content the model later reads, such as a web page, a PDF, an email, a support ticket. The user never sees it, but the model treats it as input and acts on it. For any system that pulls in outside content, indirect injection is the first thing we reach for. We go deep on detecting it in production under prompt injection monitoring.

Jailbreaks and guardrail bypass

Getting the model to produce output its safety layer is supposed to block. Roleplay framings, hypothetical wrappers, encoded payloads, language switching, splitting a request across several turns so no single message looks bad. We’re not collecting party tricks. We’re testing whether the guardrail is a real boundary or a thin filter that holds against the obvious phrasing and folds against a determined one. Most bolt-on filters fall into the second group.

Data extraction and leakage

Pulling out things the model should never reveal: its system prompt, secrets pasted into context, fragments of training data, or other users’ data sitting in a shared retrieval store. In RAG-backed apps this is often the highest-impact finding, because the model has live access to a knowledge base and the right phrasing convinces it to return records that the asking user was never authorized to see. The fixes here are mostly architectural, which is why we treat them as part of securing LLM applications rather than something you can prompt your way out of.

Model abuse and excessive agency

When the model can call tools, query databases, hit APIs, or trigger actions, the prompt stops being just text and becomes a way to make the system do things. OWASP labels this excessive agency, and it’s where a clever prompt turns into a real-world consequence: a refund issued, a record deleted, mail sent. We test whether a low-privilege user, or hidden text in some document, can drive the agent into actions it should never take. This overlaps heavily with how we approach controlling AI agents.

Bias and safety failures

Depending on scope, we probe for outputs that are discriminatory, harmful, or wildly off-policy for the deployment, and for the conditions that reliably trigger them. For a regulated client this is often where compliance pressure is heaviest, since a model that produces unfair or unsafe output in front of customers is both a brand problem and, increasingly, a legal one.

Our methodology, start to finish

Creativity matters in this work, but it sits on top of a repeatable process. Skip the structure and you get a pile of cool jailbreaks with no way to measure coverage or track whether anything got fixed.

1. Scoping

We start by mapping the system. What’s the model, what can it read, what tools and data can it reach, who talks to it, and what’s the worst realistic outcome if it misbehaves? That last question drives everything. Red-teaming a marketing chatbot with no tools is a different job from red-teaming an agent wired into your billing system, and the scope reflects that.

2. Threat modeling with MITRE ATLAS

We build the threat model on MITRE ATLAS, a knowledge base of real adversary tactics and techniques against AI systems, modeled on the ATT&CK framework most security teams already know. ATLAS gives us a shared map of how these systems get attacked in the wild, from reconnaissance through model access, evasion, and exfiltration, so coverage is driven by documented attacker behavior rather than whatever we happen to think of on the day.

3. Manual and automated adversarial testing

Then we attack, on two tracks at once. Automated tooling fires large batches of known attack patterns and probes for regressions, which is how you measure success rates and cover ground at scale. Manual testing is where the real findings come from: a human chaining context, reading how the model reacts, and adapting in ways no scripted suite will. Automation tells you how often a known attack lands. A skilled operator finds the attack nobody scripted yet. You need both, and the manual side is where experience separates a real red team from a tool run.

4. Reporting with reproducible findings

A finding you can’t reproduce is an anecdote. Because models are non-deterministic, we report every issue with the exact input, the conditions, an observed success rate across repeated attempts, the impact, and a concrete fix. Your engineers should be able to paste in our payload and watch it work, then confirm it’s dead after they patch. That’s the bar. Vague writeups help nobody.

json
{
  "finding_id": "AIRT-2025-014",
  "title": "Indirect prompt injection via retrieved document leads to tool call",
  "category": "Prompt Injection (OWASP LLM01)",
  "atlas_technique": "AML.T0051 - LLM Prompt Injection",
  "severity": "High",
  "target": "support-assistant (RAG + email tool)",
  "reproduction": {
    "steps": [
      "Upload a document containing hidden instructions to the knowledge base",
      "As a normal user, ask a question that retrieves that document",
      "Observe the assistant follow the embedded instruction"
    ],
    "payload_ref": "appendix/airt-014-payload.txt",
    "success_rate": "7/10 attempts"
  },
  "impact": "Untrusted content drives an authenticated email action the user never requested.",
  "remediation": [
    "Treat retrieved content as untrusted; never as instructions",
    "Require human approval for the email tool (see Excessive Agency, LLM06)",
    "Scope the tool to allow-listed recipient domains only"
  ]
}

How does AI red teaming map to compliance?

Most clients come to us for security, but the same work answers a fast-growing list of compliance questions, so it’s worth knowing how findings line up with the frameworks.

The NIST AI Risk Management Framework is built around four functions: Govern, Map, Measure, and Manage. Red teaming feeds the Measure function directly. It’s how you actually test for the risks you mapped, instead of asserting on paper that the system is safe. NIST’s generative AI profile calls out adversarial testing by name, which makes a red team engagement a clean way to show you’ve measured real risk.

The EU AI Act adds legal weight. For high-risk and general-purpose AI systems it expects adversarial testing and ongoing risk management, and providers of the most capable general-purpose models are expected to perform and document adversarial testing as part of their obligations. A red team report with reproducible findings, severities, and fixes is exactly the kind of evidence that maps onto those requirements. We don’t sell compliance theater, but the artifacts this work produces tend to be what auditors and regulators want to see.

AML.T0051The MITRE ATLAS technique ID for LLM prompt injection, the entry point behind a large share of the high-severity findings we reportMITRE ATLAS

When should you red-team an AI system?

Earlier than most teams do. The usual pattern is shipping an AI feature, watching it touch real users and real data, and only then asking whether it’s safe. By that point an attacker can ask the same question for free, in production.

Red-team before launch for anything customer-facing or anything wired to tools and sensitive data. Re-test after meaningful changes, because a new system prompt, an added tool, or a swapped base model can quietly reopen something you’d closed. And treat it as continuous for systems that keep evolving. The model and its guardrails drift, attacker techniques move fast, and a clean report from six months ago says little about today. When something does slip through in production, that’s where our incident response and threat intelligence team picks it up. This all sits inside the wider program we describe under AI security.

The model was trained to be helpful. Our job is to find every way that helpfulness can be turned against the people who deployed it, and to prove it before someone with worse intentions does.

Cyber72 offensive security team

If you’re putting an AI system in front of customers or hooking it up to anything that matters, get someone who attacks these systems to test it first. That’s the work we do. See how we run offensive testing, or talk to our team about scoping an AI red team engagement before launch rather than after the incident.

FAQ

What is AI red teaming?

AI red teaming is structured adversarial testing of an AI system. A dedicated team deliberately attacks the model to make it leak data, bypass its guardrails, follow malicious instructions, or abuse the tools it can call, then documents each issue with reproducible steps and a fix. NIST describes it as structured testing to find flaws and vulnerabilities in an AI system, usually in a controlled setting and alongside the developers.

How is AI red teaming different from penetration testing?

A pentest targets deterministic systems where the same input gives the same output and vulnerabilities fall into known classes. AI red teaming targets a probabilistic model, so findings are measured as success rates rather than pass or fail, the input space is open-ended natural language, and the worst issues are often behavioral, the model doing exactly what an attacker asked. The core attacker mindset is the same, but the methodology and the metrics differ.

What does an AI red team test for?

Common categories include prompt injection (direct and indirect), jailbreaks and guardrail bypass, data and training-data extraction, model and tool abuse from excessive agency, and bias or safety failures. These line up with the OWASP Top 10 for LLM Applications. Scope also covers the system around the model, including retrieval pipelines, tools, and any action the model can trigger.

Does AI red teaming help with the EU AI Act and NIST AI RMF?

Yes. It maps directly to the Measure function of the NIST AI Risk Management Framework, which expects you to test the risks you've identified, and NIST's generative AI guidance names adversarial testing explicitly. The EU AI Act expects adversarial testing and risk management for high-risk and general-purpose AI systems. A red team report with reproducible findings, severities, and remediation is the kind of evidence those frameworks ask for.

When should we run an AI red team engagement?

Before launching anything customer-facing or anything connected to tools and sensitive data, again after meaningful changes like a new system prompt, an added tool, or a swapped model, and continuously for systems that keep evolving. Models and their guardrails drift and attacker techniques move quickly, so a single point-in-time test goes stale faster than a traditional pentest.

Vadim L
Vadim L
cyber72 · operator

Vadim L — founder and full-stack engineer with 25+ years building platforms, security products, and AI tooling. His background spans system administration, analysis, and cybersecurity, working hands-on from architecture through delivery. He writes about offensive security and what actually holds up in production.

./retainer --start

Offensive security, on call.

Start an engagement  24/7 IR hotline 60-min responder on retainer · MTTA < 90s