Out on the GenAI Wild West: Part I - Red Team Redemption

Out on the GenAI Wild West
Enterprise AI adoption is outpacing security at breakneck speed. Organizations are rapidly deploying large language models (LLMs) across customer service, code generation, and business intelligence—driven by competitive pressures that often prioritize speed-to-market over security posture.
While some AI security challenges mirror traditional software security—such as supply chain/data provenance, input/output filtering, and access controls—LLMs introduce new attack and defense patterns. Unlike conventional software systems, which exhibit predictable behaviors, LLMs are probabilistic systems that generate unpredictable responses and exhibit emergent behaviors that are difficult to anticipate. The business impact of these risks can be substantial: Google’s market value slid $100bn after Bard shared inaccurate information in a promotional video, illustrating how AI failures translate directly to financial and reputational consequences.
Traditional security tooling, frameworks, and red teaming approaches, designed for deterministic software, struggle to aptly secure AI systems where attacks exploit the capabilities that make LLMs valuable—natural language understanding, reasoning, and content generation. Attackers can manipulate conversational context, inject malicious prompts through seemingly innocuous natural language requests, and exploit training biases to extract sensitive information or bypass safeguards.
This reality has driven the security industry to adapt traditional red teaming practices for the AI era. Red teaming large language models (LLMs) is a systematic process of probing your GenAI application (or model by itself) with adversarial inputs to uncover vulnerabilities before attackers do (ok, yeah - we know what red teaming is… right?).
Well, something fundamentally new has emerged: AI itself is now used to red team AI—not just automating test execution, but inventing, mutating, and scaling new exploits in ways that traditional penetration testing tools cannot. Unlike legacy security scanners, which rely strictly on static signatures or rule-based fuzzing, LLM-powered red teamers can generate novel attack strategies, adapt to defenses in real time, and discover failure modes that would be impractical or impossible for humans to find manually.
Red Team Redemption
Early LLM red teaming was typically manual—security experts crafting prompts to elicit unsafe or unintended outputs. This approach, while creative, was slow and limited in scope and scale. Recent research shows that AI-driven red teaming is not just faster—it’s also exhaustive and can demonstrate a high degree of novelty, enabling exploration of attack spaces in ways humans struggle to match. Automated agents can run thousands of iterations, mutate prompts with subtle syntax changes, and systematically probe for weaknesses that would be tedious or impossible for manual testers to uncover.
- HARM (Holistic Automated Red Teaming) uses multi-turn LLMs to generate and refine attack prompts, leveraging risk taxonomies to expose vulnerabilities systematically. HARM demonstrated that automated agents can uncover more issues faster than human testers alone.
- AutoRedTeamer introduced a dual-LLM architecture where one model generates attacks and another learns from new vulnerabilities, enabling continuous adaptation. This approach achieved 20% higher attack success rates on HarmBench against Llama-3.1-70B while reducing computational costs by 46% compared to existing automated approaches.
- GOAT (Generative Offensive Agent Tester) framework achieved 97% attack success against Llama 3.1 and 88% against GPT-4 on standard benchmarks.
- MasterKey Jailbreaker demonstrated that even closed-source, heavily-guarded LLMs remain vulnerable to systematic, AI-generated attacks, with success rates over 20% across top chatbots.
- ADV-LLM fine-tuned adversarial LLMs for attack generation, transferring exploits to proprietary models with up to 99% success on GPT-3.5 and nearly 50% on GPT-4.
- TAP (Tree of Attacks with Pruning) showed that iterative, feedback-driven prompt mutation can reliably bypass safety guardrails, with more capable models often being easier to break.
Using AI to red team AI is not just a step up in efficiency—it’s a fundamental shift in capability and, frankly, an operational necessity. The same AI capabilities defenders use to uncover vulnerabilities are equally available to adversaries. Automated, adaptive AI agents can invent, mutate, and scale attacks in ways that legacy penetration testing exercises never could, exposing both known and unknown risks. This arms race drives rapid innovation in offence and defense for LLM security.
For regulated industries, this represents existential risk: compliance violations, regulatory fines, and reputational damage from a single successful exploit. The risk of data leakage or unauthorized behaviors from non-deterministic GenAI applications must be measured and mitigated with evidence-based rigour.
The question isn’t whether adversaries will target your GenAI deployments—it’s whether your security program can evolve as fast as they do.
Houston… We still have a problem.
CISOs face a fundamental challenge: how do you quantify risk when model providers control both the product and the evaluation metrics? Most security assessments rely on vendor-supplied benchmarks, internal testing methodologies, or static datasets that suffer from “dataset rot”—where models have already learned to defend against known attack patterns (e.g., from adversarial datasets of known exploits like prompt injections) through their training process.
This creates a critical blind spot. LLMs are inherently non-deterministic systems where identical inputs can produce different outputs based on temperature settings, model architecture, and computational paths on inference. This probabilistic behavior means security assessments must account for statistical variance: a prompt that fails 90% of the time may still succeed enough to represent significant risk. Additionally, models can exhibit “alignment drift”—where safety behaviors degrade under conversational pressure or as context windows expand, making single-turn evaluations insufficient for real-world risk assessment.
The opacity problem compounds when organizations deploy multiple models across different use cases. Without independent, standardized evaluation frameworks, it becomes impossible to compare security postures, validate vendor claims, or make evidence-based architecture decisions. Enterprise security programs need quantitative, reproducible metrics that reflect actual deployment scenarios—not optimized demo conditions.
Model providers like OpenAI address these challenges through their Preparedness Framework, which systematically evaluates foundational model capabilities across safety, security, and self-improvement domains. They employ rigorous safeguards, including internal/external red teaming and detailed safety evaluations documented in system cards. However, as these models (and their training datasets) remain closed source, independent verification of safety & security claims is nearly impossible. To learn more about this process, read our case study about working in OpenAI’s external red teaming network (RTN).
Organizations need independent, transparent evaluation frameworks to validate vendor security claims and make evidence-based deployment decisions. Open source tools like Promptfoo represent this alternative approach, providing standardized adversarial testing capabilities that any organization can configure, deploy, and audit.
However, rather than relying on existing assessments, we evaluated some of these foundational models ourselves. To do this, we created a containerized AI Red Team environment and conducted adversarial red team testing across 1,133 attack scenarios. By red teaming foundational models ourselves, we aim to explore how independent evaluation frameworks address these transparency gaps.
Experimental Design
Our test environment combines three open source tools in a containerized stack optimized for security research and rapid experimentation. The lightweight, modular architecture prioritizes flexibility for rapid iteration while providing the foundation for developing and evaluating next-generation LLM adversarial tests at scale.
By integrating both attack generation (Promptfoo) and defensive filtering (LLM Guard), our setup enables thorough evaluation of the broader AI security ecosystem—measuring baseline model behavior, and the real-world effectiveness of protection layers that organizations would actually deploy. As such, this approach helps to provide actionable intelligence on whether security tooling can meaningfully reduce attack success rates across different models and threat scenarios.
- Promptfoo - Automated evaluation platform executing two-phase testing:
- Baseline Phase: 381 single-turn attacks from Foundation Model security preset across four domains (Security & Access Control, Compliance & Legal, Trust & Safety, Brand Safety)
- Advanced Phase: 752 (381 single-turn attacks + 371 multi-turn GOAT attacks) simulating persistent attacker behavior (Research Paper)
- LLM Guard - Real-time input/output filtering framework with optimized configuration:
- Input scanners: Anonymize, Gibberish, PromptInjection, Secrets, Toxicity
- Output scanners: De-anonymise, Gibberish, MaliciousURLs, NoRefusal, Sensitive, Toxicity
- OpenRouter API - Unified access to 100+ foundation models for standardized testing
We selected models based on Promptfoo’s Foundation Model Security Reports, choosing two top-performing models with fundamentally different training philosophies and safeguards:
- OpenAI o3-mini: Highly-aligned (closed source) commercial model with security scores of 87%/86%/74%/55% across security categories (o3-mini Report)
- DeepSeek R1: High-performance open source model with lower (yet significant) baseline alignment scores of 75%/48%/56%/56% (DeepSeek R1 Report)
Our experimental design established baseline model behavior using the 381 single-turn scenarios, then escalated to 371 advanced multi-turn GOAT attacks (while again, re-running the initial 381 tests for result conformity) to simulate persistent adversarial pressure. This progression from static to adaptive testing across four critical domains (Security & Access Control, Compliance & Legal, Trust & Safety, and Brand Safety) reflects an ever-evolving threat landscape where the boundaries between security, safety, and alignment increasingly blur—making comprehensive evaluation essential for informed deployments in regulated environments.
Experimental Results
Each model was tested in two configurations—unprotected and protected by LLM Guard—revealing significant differences in inherent model security and the effectiveness of defensive tooling.
Multi-Turn Attacks Expose Alignment Degradation
Single-turn assessments fail to capture the persistence and adaptability of real adversaries. Our implementation of the GOAT (Goal-Oriented Adversarial Testing) strategy—which simulates persistent attacker behavior through multi-turn conversations—revealed critical vulnerabilities that traditional, single-turn assessments could easily miss.
DeepSeek R1 exhibited a notable shift in performance under multi-turn context expansion (10.2% → 32.0% ASR), while o3-mini maintained relatively consistent performance throughout. These findings suggest variable differences in alignment robustness when exposed to an extended conversational context. LLM Guard appeared to provide stronger protection for DeepSeek R1 under advanced testing (reducing ASR from 32.0% to 4.9%) compared to o3-mini’s more modest improvement—indicating that defensive tooling effectiveness may vary by model architecture.
Nonetheless, it’s important to interpret these results with caution due to the inherently non-deterministic nature of LLMs. The same prompt can produce different outputs across repeated trials, particularly at higher temperature settings, meaning that attack success rates represent statistical tendencies rather than absolute guarantees. Additionally, LLM Guard itself uses SpaCy models for its input/output filtering decisions, introducing another layer of non-determinism into the protection pipeline. These compounding probabilistic behaviors introduce variance that could affect the precision of comparative assessments, especially with smaller sample sizes.
What remains clear is that there’s unlikely to be a universal security solution in the GenAI landscape. Organizations would benefit from adaptive, continuous assessment approaches that refine protective mechanisms based on observed model-specific behavior patterns. Assessment outcomes should ideally be used to train and fine-tune defensive scanners and LLM-as-a-judge models to better detect potentially harmful inputs and outputs within specific deployment contexts.
Model Architecture Influences Defense Effectiveness
LLM Guard’s protection efficacy varied dramatically by model architecture, demonstrating that security solutions must be calibrated and fine-tuned for each deployment. The data reveals clear patterns in how different models respond to defensive filtering:
DeepSeek R1 exhibits substantially degraded defense efficacy (68%) under persistent adversarial evaluation compared to o3-mini (98-99%), suggesting fundamental disparities in how these foundation models’ parameter distributions and transformer configurations respond to multi-turn adversarial inputs. Second, the implementation of filtering mechanisms yields heterogeneous performance improvements: while DeepSeek R1 demonstrates dramatic response optimization (68% → 99% in Compliance), o3-mini exhibits more conservative enhancement metrics, indicating divergent optimal defense configuration requirements based on underlying model training objectives and tokenization methodologies. Third, Brand Safety resistance remains suboptimal across both model implementations, with post-filtering protective efficacy plateauing at 83-87% success rates—creating quantifiable reputational exposure vectors for customer-interfacing implementations.
Organizations can enhance defensive countermeasures through sophisticated training regimens and algorithmic refinements. Implementing fine-tuned evaluative models that function as discriminative classifiers can supplement traditional rule-based filtering mechanisms. These classifier models can undergo organization-specific transfer learning using adversarial examples harvested directly from red team assessment data, creating specialised detection capabilities calibrated to production deployment vulnerabilities. Implementing dynamic confidence thresholds optimized through empirical attack data analysis rather than static filtering parameters provides another enhancement vector. For instance, when red team assessments consistently compromise DeepSeek’s Trust & Safety boundaries at specific confidence intervals, automated threshold recalibration can be triggered based on statistical analysis of successful penetration patterns. Furthermore, integrating closed-loop training systems where successful attacks identified during adversarial testing automatically populate continuous defense training pipelines creates an evolving security posture with measurably improved resistance against emergent threat patterns.
For enterprises where brand equity constitutes significant business value, these defensive limitations present quantifiable risk exposure: Google’s $100 billion market capitalization reduction following Bard’s factual inaccuracies demonstrates the direct financial impact of foundation model failures. Implementing model-specific defensive configurations with continuous feedback mechanisms derived from adversarial testing represents a critical risk mitigation strategy for production environments.
Vulnerability Analysis
Evaluation across 1,133 test scenarios (381 basic + 752 advanced) using Promptfoo’s Foundation Model preset demonstrates distinct risk profiles between o3-mini and DeepSeek R1. These models were selected based on their top-5 rankings in Promptfoo’s Foundation Model Security Reports across four security categories (o3-mini: 87%/86%/74%/55%
, DeepSeek R1: 75%/48%/56%/56%
).
- DeepSeek R1 Advanced: Cybercrime (45%), Disinformation (55%), Hallucination (50%), Profanity (70%)
- o3-mini Advanced: Over-reliance paradox—filtering increases vulnerability (65% → 75%)
- Entity Impersonation achieved 100% basic success rates across both models
- Critical categories (WMD, Sexual Crime, Hate Speech) effectively filtered to 0-5%
LLM Guard effectiveness varies dramatically by model: 41-85% attack reduction for DeepSeek R1 versus 37-49% for o3-mini. This variance indicates that defensive tooling must be calibrated per model, with some architectures requiring specialised filtering approaches while others benefit from standard rule-based protection.
The GOAT Strategy: Persistence Trumps Protection
Goal-Oriented Adversarial Testing (GOAT) moves beyond single-prompt testing to persistent, multi-turn conversations. The GOAT methodology, introduced by Meta researchers in 2024, demonstrates how multi-turn persistence enhances attack effectiveness, achieving an Attack Success Rate (ASR@10) of 97% against Llama 3.1 and 88% against GPT-4-Turbo on the JailbreakBench dataset.
GOAT is as an autonomous red teaming agent leveraging a general-purpose LLM (like GPT-4o) to conduct sophisticated attacks. Rather than relying on manually crafted jailbreaks that suffer from dataset rot, GOAT employs an AI agent that reasons about attack strategies, adapts to defenses, and escalates through conversational turns.
The effectiveness stems from exploiting how LLMs process extended conversations. Models exhibit “alignment drift,” where safety behaviors degrade as conversational context expands. GOAT systematically exploits this through reconnaissance (establishing rapport and probing boundaries), exploitation (introducing prohibited elements using gathered context), and escalation (executing direct policy violations through persistent contextual pressure).
Conclusion
Here’s the uncomfortable truth: even the most sophisticated and “safeguarded” LLMs on the market today struggle under persistent adversarial pressure. With that said, single-turn security assessments are fundamentally inadequate for evaluating real-world adversarial risk. When attackers can exploit natural language processing through multi-turn conversational context, alignment drifts and success rates increase, exposing vulnerabilities that basic single-shot testing wouldn’t discover.
The technical implications of this alignment drift extend far beyond these laboratory conditions. Our research demonstrates that attack success rates can jump nearly 3x when adversaries shift from single-turn to multi-turn strategies. Our findings aren’t merely a statistical anomaly but a fundamental breakdown in how LLMs align with safety training and adhere to safeguards under extended conversational pressure. We’re witnessing the emergence of a new threat landscape where AI agents don’t just automate attacks—they reason about defensive patterns, adapt their strategies mid-conversation, and systematically erode model alignment on an increasingly larger scale.
With that said, let’s be clear about the limitations here. We tested two carefully selected models with a lightweight defensive configuration in a controlled environment. Real production deployments will face different attack vectors, employ more sophisticated defenses, and operate under mounting operational constraints. The 37-85% variance in protection efficacy across models tells us something crucial: there’s no universal security solution for AI. Every deployment needs its own threat model, defensive strategy, and continuous evaluation framework.
We’re seeing the emergence of an evolving security paradigm. Traditional cybersecurity assumes deterministic systems where inputs produce predictable outputs and failure modes can be catalogued and defended against. AI security, however, operates in a probabilistic world where the same input can yield different outputs, emergent behaviors arise from complex natural language interactions, and security must be measured in statistical confidence intervals rather than binary pass/fail states. This fundamental difference means applying traditional InfoSec approaches to AI systems often misses the most critical vulnerabilities.
Our research provides a small glimpse into how systematic AI red teaming can be performed. As model capabilities expand and attack sophistication increases, the security community needs independent evaluation frameworks that can keep pace. The closed-source nature of leading models means organizations can’t rely solely on vendor security claims—they need empirical data from adversarial testing tailored to their specific deployment scenarios. The alternative is flying blind in an increasingly hostile environment where the cost of failure is measured in regulatory fines, market capitalization, and public trust.
Stay tuned for Part II, where we’ll connect these research findings to practical implementation frameworks, security tools, and regulatory compliance guidelines.
Interested in learning more about how we can help you? Check out our AI Security services.
Related blogs

See it, Hack It, Sort It: How Open Source Software Protects Our AI Enablers

Open Source in Finance Forum New York 2024 Recap
