defense 2026

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Hao Wang ¹, Yanting Wang ¹, Hao Li ¹, Rui Li ², Lei Sha ^1,3

¹ Beihang University

² Peking University

³ Zhongguancun Laboratory

0 citations · 57 references · arXiv

Published on arXiv

2601.10589

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SSP autonomously evolves robust defense capabilities that significantly outperform baselines trained on static adversarial datasets across extensive experiments.

Safety Self-Play (SSP)

Novel technique introduced

Large Language Models (LLMs) have achieved remarkable capabilities but remain vulnerable to adversarial ``jailbreak'' attacks designed to bypass safety guardrails. Current safety alignment methods depend heavily on static external red teaming, utilizing fixed defense prompts or pre-collected adversarial datasets. This leads to a rigid defense that overfits known patterns and fails to generalize to novel, sophisticated threats. To address this critical limitation, we propose empowering the model to be its own red teamer, capable of achieving autonomous and evolving adversarial attacks. Specifically, we introduce Safety Self- Play (SSP), a system that utilizes a single LLM to act concurrently as both the Attacker (generating jailbreaks) and the Defender (refusing harmful requests) within a unified Reinforcement Learning (RL) loop, dynamically evolving attack strategies to uncover vulnerabilities while simultaneously strengthening defense mechanisms. To ensure the Defender effectively addresses critical safety issues during the self-play, we introduce an advanced Reflective Experience Replay Mechanism, which uses an experience pool accumulated throughout the process. The mechanism employs a Upper Confidence Bound (UCB) sampling strategy to focus on failure cases with low rewards, helping the model learn from past hard mistakes while balancing exploration and exploitation. Extensive experiments demonstrate that our SSP approach autonomously evolves robust defense capabilities, significantly outperforming baselines trained on static adversarial datasets and establishing a new benchmark for proactive safety alignment.

Key Contributions

Safety Self-Play (SSP): a single LLM acting concurrently as both attacker (jailbreak generator) and defender (refusal) within a unified RL training loop, enabling adversarial co-evolution without a fixed external attacker.
Reflective Experience Replay Mechanism with UCB sampling that prioritizes low-reward failure cases from an accumulated experience pool, balancing exploration of new attacks and exploitation of past hard mistakes.
Empirical demonstration that SSP significantly outperforms defenses trained on static adversarial datasets, generalizing better to novel jailbreak strategies.

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformerrl

Threat Tags

inference_timeblack_box

Applications

llm safety alignmentjailbreak defense

Read PDF arXiv DOI

Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization

Learning to Extract Context for Context-Aware LLM Inference

Safety Alignment of LMs via Non-cooperative Games

Constitutional Classifiers++: Efficient Production-Grade Defenses against Universal Jailbreaks

SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

Knowing When Not to Answer: Lightweight KB-Aligned OOD Detection for Safe RAG