Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment
Zehua Cheng 1,2, Jianwei Yang 3, Wei Dai 2, Jiahao Sun 2
Published on arXiv
2602.01587
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Reduces GCG Attack Success Rate from 84.2% to 1.2% on Llama-3 while maintaining 94.1% benign utility, vs. 74.3% utility for character-level baselines.
CSS + NAAT (Certified Semantic Smoothing via Stratified Randomized Ablation + Noise-Augmented Alignment Tuning)
Novel technique introduced
Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.
Key Contributions
- Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, which partitions inputs into immutable structural prompts and mutable payloads to derive rigorous l0-norm guarantees using the Hypergeometric distribution
- Noise-Augmented Alignment Tuning (NAAT), a fine-tuning technique that trains the LLM on ablated inputs to become a semantic denoiser, resolving the utility degradation ('inverted scaling fallacy') caused by sparse ablated contexts
- Empirical validation on Llama-3 showing ASR reduction from 84.2% to 1.2% against gradient-based attacks while retaining 94.1% benign utility, far outperforming character-level baselines
🛡️ Threat Analysis
Defends specifically against gradient-based adversarial suffix attacks (GCG, AutoDAN) — token-level perturbations optimized to bypass safety alignment. Provides certified l0-norm robustness guarantees via randomized smoothing, the canonical ML01 defense paradigm.