attack 2026

Generalization Limits of Reinforcement Learning Alignment

Haruhi Shida , Koo Imai , Keigo Kansa

0 citations

α

Published on arXiv

2604.02652

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Attack success rate increased from 14.3% with individual jailbreak methods to 71.4% when combining multiple attack techniques against OpenAI gpt-oss-20b

Compound Jailbreaks

Novel technique introduced


The safety of large language models (LLMs) relies on alignment techniques such as reinforcement learning from human feedback (RLHF). However, recent theoretical analyses suggest that reinforcement learning-based training does not acquire new capabilities but merely redistributes the utilization probabilities of existing ones. In this study, we propose ``compound jailbreaks'' targeting OpenAI gpt-oss-20b, which exploit the generalization failures of alignment. This approach combines multiple attack techniques -- each individually defended against -- to saturate the instruction hierarchy maintenance process. Our evaluation shows that the attack success rate (ASR) increased from 14.3\% with individual methods to 71.4\% with the combined approach. These results provide empirical evidence for the hypothesis that safety training does not generalize as broadly as model capabilities, highlighting the need for multifaceted safety evaluations using compound attack scenarios.


Key Contributions

  • Proposes 'compound jailbreaks' that combine multiple individually-defended attack techniques to exploit generalization limits of RLHF alignment
  • Empirically validates the hypothesis that safety training does not generalize as broadly as model capabilities
  • Demonstrates that compound attacks increase ASR from 14.3% (individual methods) to 71.4% on OpenAI gpt-oss-20b

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Applications
llm safetychatbotdialogue systems