attack 2026

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang ¹, Yiming Luo ¹, Aimin Zhou ^1,2, Fei Tan ¹

¹ East China Normal University

² Shanghai Innovation Institute

0 citations

Published on arXiv

2604.17769

Prompt Injection

OWASP LLM Top 10 — LLM01

Red-Team Agents

LLMs for Security — LS06

Benchmarks & Evaluation

LLMs for Security — LS10

Key Finding

Probability clamping improves semantic coherence by 15% while preserving high toxicity scores in generated adversarial data

R-CAI

Novel technique introduced

Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.

Key Contributions

Reverse Constitutional AI (R-CAI) framework that inverts harmlessness principles into a constitution of toxicity for systematic adversarial data generation
Probability clamping mechanism within RLAIF to prevent reward hacking and preserve semantic coherence while maintaining adversarial strength
Fully automated pipeline for scalable multi-dimensional toxic data synthesis across legal/ethical, social bias, behavioral consequence, and deception categories

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_timetargeted

Applications

llm safety evaluationred teamingadversarial robustness testing

Read PDF arXiv Code

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling

Involuntary In-Context Learning: Exploiting Few-Shot Pattern Completion to Bypass Safety Alignment in GPT-5.4

Can You Trick the Grader? Adversarial Persuasion of LLM Judges

In-Context Representation Hijacking

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

Turning Logic Against Itself : Probing Model Defenses Through Contrastive Questions

SpatialJB: How Text Distribution Art Becomes the "Jailbreak Key" for LLM Guardrails