Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.

Key Contributions

Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, which partitions inputs into immutable structural prompts and mutable payloads to derive rigorous l0-norm guarantees using the Hypergeometric distribution
Noise-Augmented Alignment Tuning (NAAT), a fine-tuning technique that trains the LLM on ablated inputs to become a semantic denoiser, resolving the utility degradation ('inverted scaling fallacy') caused by sparse ablated contexts
Empirical validation on Llama-3 showing ASR reduction from 84.2% to 1.2% against gradient-based attacks while retaining 94.1% benign utility, far outperforming character-level baselines

🛡️ Threat Analysis

Input Manipulation Attack

Defends specifically against gradient-based adversarial suffix attacks (GCG, AutoDAN) — token-level perturbations optimized to bypass safety alignment. Provides certified l0-norm robustness guarantees via randomized smoothing, the canonical ML01 defense paradigm.