defense 2025

Rebellion: Noise-Robust Reasoning Training for Audio Reasoning Models

Tiansheng Huang 1, Virat Shejwalkar 2, Oscar Chang 2, Milad Nasr 3, Ling Liu 1

0 citations · arXiv

α

Published on arXiv

2511.09682

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Rebellion protects Qwen2-Audio against advanced audio jailbreaks without degrading benign task performance, significantly outperforming standard reasoning training on the accuracy-safety trade-off.

Rebellion

Novel technique introduced


Instilling reasoning capabilities in large models (LMs) using reasoning training (RT) significantly improves LMs' performances. Thus Audio Reasoning Models (ARMs), i.e., audio LMs that can reason, are becoming increasingly popular. However, no work has studied the safety of ARMs against jailbreak attacks that aim to elicit harmful responses from target models. To this end, first, we show that standard RT with appropriate safety reasoning data can protect ARMs from vanilla audio jailbreaks, but cannot protect them against our proposed simple yet effective jailbreaks. We show that this is because of the significant representation drift between vanilla and advanced jailbreaks which forces the target ARMs to emit harmful responses. Based on this observation, we propose Rebellion, a robust RT that trains ARMs to be robust to the worst-case representation drift. All our results are on Qwen2-Audio; they demonstrate that Rebellion: 1) can protect against advanced audio jailbreaks without compromising performance on benign tasks, and 2) significantly improves accuracy-safety trade-off over standard RT method.


Key Contributions

  • First study of jailbreak safety in Audio Reasoning Models, revealing that standard reasoning training with safety data fails against advanced audio jailbreaks due to representation drift
  • Novel audio jailbreak attacks that exploit representation drift between vanilla and advanced jailbreak inputs to force harmful outputs
  • Rebellion: a robust reasoning training method that trains ARMs to resist worst-case representation drift, improving accuracy-safety trade-off over standard reasoning training on Qwen2-Audio

🛡️ Threat Analysis

Input Manipulation Attack

The proposed advanced jailbreaks exploit 'representation drift' through audio-domain manipulations (noise/perturbations) to bypass safety guardrails — analogous to adversarial visual inputs to VLMs; Rebellion trains against worst-case representation drift, functioning as adversarial robustness training in the audio domain.


Details

Domains
audionlp
Model Types
llmmultimodal
Threat Tags
inference_timeblack_box
Datasets
Qwen2-Audio evaluation benchmarks
Applications
audio language modelsaudio reasoning systems