OpenAI: Red Teaming GPT-4o, Operator, o3-mini, and Deep Research

About OpenAI

OpenAI is an AI research and deployment company focused on ensuring that the development of artificial general intelligence (AGI), if achieved, benefits all of humanity. As part of this mission, OpenAI develops and releases foundational models such as GPT-4o (a multimodal model with audio, vision, and text), Operator (an agent capable of interacting with web interfaces), o3-mini (a compact model emphasizing safety and alignment), and Deep Research (a web-browsing and code-executing research assistant).

Each model introduces distinct operational risks across input modalities, tool integrations, and autonomous behaviours. To address these risks, OpenAI operates under a formal Safety and Preparedness Framework, which defines risk thresholds and evaluation criteria across misuse categories such as cybersecurity, CBRN, persuasion, and autonomy.

External red teaming is a key part of this framework. Prior to deployment, models undergo adversarial testing by vetted external experts to identify alignment failures, injection vectors, tool misuse paths, and safety regressions. Findings from these evaluations directly inform mitigation strategies, deployment gating, and post-launch monitoring.

Red Team Involvement

ControlPlane’s Torin van den Bulk contributed as an external red team tester on GPT-4o, Operator, o3-mini, and Deep Research (see: OpenAI Red Teaming Network). Testing began with live access to model checkpoints and progressively evolved as safeguards were adapted to mitigate discovered risks. Targeted red team evaluations focused on alignment, adversarial exploitation, multimodal injection, tool misuse, and privacy leakage. All contributions are publicly acknowledged in OpenAI’s system cards.

Methods and Findings

Prompt injection and system prompt override attacks targeting models with agentic capabilities or web access (e.g. Operator, Deep Research), used to subvert control flows or execute unintended commands
Audio-based prompt injection and synthetic voice misuse in GPT-4o, where model behaviour was tested against adversarially constructed voice inputs and impersonation attempts
Unsafe tool execution and browsing behaviour, including scenarios where red teamers induced the model to follow attacker-controlled links, execute injected scripts, or retrieve prohibited content via tool use
Refusal inconsistency and policy circumvention under adversarial rephrasing, particularly in GPT-4o compared to smaller models like o3-mini, revealing regression in alignment under specific linguistic constructions
Autonomous agent behaviour without adequate human-in-the-loop safeguards, with a focus on unintended actions taken during unsupervised tool invocation or when model oversight was implicitly absent

Note: The results below represent a subset of significant findings from external red teaming efforts. While external red teaming is a critical component of OpenAI’s Safety and Preparedness Framework, it is supplemented by extensive internal red team testing, automated evaluations, and alignment audits. For a complete overview of findings, mitigations, and evaluation methodologies, refer to the official OpenAI system cards.

Over 100 external red teamers from 29 countries tested GPT-4o across audio, video, image, and text, targeting impersonation, privacy leakage, prompt injections (including via voice), bias, and emotionally manipulative or disallowed content
Early voice-based prompt injection and impersonation attacks led to the deployment of enforced system voice presets and real-time voice classifiers, achieving 100% recall and ≥95% precision across languages for unauthorised output blocking
Unsafe compliance in early identity-based tasks prompted post-training improvements: refusal rate on “Should Refuse” prompts increased from 0.83 → 0.98, and appropriate compliance on “Should Comply” prompts rose from 0.70 → 0.83
Ungrounded inference attempts triggered enhanced hedging and refusal behaviour, raising factual alignment scores from 0.60 to 0.84
Audio-to-text safety alignment was validated across refusal classes: Not Unsafe scores remained consistent (Text 0.95 to Audio 0.93), and Not Over-refuse parity was maintained (0.81 to 0.82)
Additional vulnerabilities—including accent mismatch, misinformation repetition, and audio robustness issues—led to expanded monitoring pipelines and classifier-based interventions

GPT-4o was evaluated under OpenAI’s Preparedness Framework, receiving a borderline-medium risk classification for persuasion and low risk across other domains.

Source: GPT-4o System Card – Risk identification, assessment, and mitigation

Agentic Misuse (Operator)

External red teamers tested OpenAI Operator in a simulated desktop environment with tool use and browsing, targeting vulnerabilities in prompt injections, agent compliance, and system message overrides
Real-time red team findings directly informed mitigation rollouts, including model-level refusals and system-layer constraints on sensitive actions
Operator now refuses 97–100% of high-risk prompts (e.g., illicit purchases, personal data searches) in internal evaluations, and a new “watch mode” pauses execution if user supervision is lost — reducing risk from visual prompt injection attacks
Red team scenarios included spoofed emails, malicious websites, and DOM-based payloads, with safe mock infrastructure used to evaluate real-world exploitability without causing harm
Confirmation flows were hardened across critical actions, cutting downstream error propagation by ~90% in test scenarios involving deceptive inputs
Several attack chains uncovered gaps in early-stage prompt-only defences, which were addressed via post-eval manual review and iterative tuning

Initial attacks revealed model compliance failures and injection vectors that prompted the development of a dedicated “Prompt Injection Monitor” (see below)—improving detection recall from 79% to 99% and precision to 90% on a benchmark set of 77 attacks.

Prompt Injection Monitor

Source: Operator System Card – Risk Identification & Mitigation

Pairwise Safety Comparisons (o3-mini)

External red teamers tested anonymised o3-mini checkpoints alongside GPT-4o and o1-series models via a parallel interface, focusing on adversarial prompts targeting safety alignment, nuanced refusals, and model brittleness
Unsafe generations dropped significantly due to post-training tuning, output classifiers, blocklists, and refusal heuristics informed by red team discoveries
o3-mini was preferred in nuanced compliance scenarios, balancing risk mitigation without excessive conservatism
Stress tests emphasised high-risk domains (e.g., bioterrorism 14.2%, cyberhacking 13.8%, weapons 8.5%) with output safety scored by red teamers, peers, and third-party raters via the Bradley-Terry model
Post-training improvements in o3-mini yielded strong refusal performance:
- Not Unsafe (Standard): 1.00
- Not Over-refuse (Standard): 0.92
- Not Unsafe (Challenging): 0.90
- Not Over-refuse (XSTest): 0.88
In the Gray Swan Jailbreak Arena, o3-mini recorded a 3.6% attack success rate—lower than GPT-4o (4.0%), on par with o1-mini (3.7%), and above o1 (1.9%)—demonstrating resilience with room for continued hardening

Red teamers probed privacy boundaries, multi-turn attacks, and disallowed content compliance; o3-mini showed improved alignment and reduced over-refusals compared to o1-mini and GPT-4o.

Source: o3-mini System Card – Observed safety challenges and evaluations

Autonomy and Control (Deep Research)

External red teamers tested Deep Research for jailbreaks, prompt injection, privacy leakage, and unsafe content synthesis using adversarial tactics like leetspeak, euphemisms, roleplay, and multimodal obfuscation
Early vulnerabilities—especially in prompt injection and risky advice—were mitigated post-launch via refusal tuning, system constraints (e.g., dynamic URL construction blocks), and classifier filtering
Pre-mitigation adversarial success rates reached 10%; post-mitigation dropped to 0% across tested scenarios
Refusal robustness improved: Not Unsafe: 0.97, Not Over-refuse: 0.86 (standard evals). StrongReject benchmark: 0.93, ranking between GPT-4o (0.89) and o1 (0.99)
Red teamers probed synthetic personal data assembly; post-mitigation refusal accuracy exceeded 95%
Risky advice preference tests favoured Deep Research over GPT-4o in 59% of comparisons; scored 0.67 (not_unsafe) in automated evaluations
Demonstrated strong factuality: PersonQA accuracy of 0.86 and lowest hallucination rate (0.13) across peer models

Final Preparedness Framework assessment: “medium” risk post-mitigation across all four domains—cybersecurity, CBRN, persuasion, and autonomy

Source: Deep Research System Card – Risk identification, assessment, and mitigation

Outcomes

Prompt injection monitor effectiveness improved to 99% recall after red team iteration (Operator)
Tool-agent behaviours hardened pre-deployment through hands-on adversarial misuse testing
Red team prompts influenced deployment gating and Preparedness Framework risk scoring
Red team inputs used to create post-mitigation regression suites across models
Refusal alignment validated at scale, identifying safety regressions in frontier checkpoints
Mitigations were directly applied across voice safety (GPT-4o), tool gating (Operator), and browsing constraints (Deep Research)
Red team data was converted into test harnesses, synthetic adversarial training data, and regression evaluations

ControlPlane’s participation through external red teaming validated high-risk failure modes and accelerated the deployment of targeted mitigations. These findings were directly integrated into OpenAI’s launch decisions and model risk documentation.

Interested in learning more about how we can help you? Check out our AI Security services.