defense 2026

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Zehua Cheng 1,2, Jianwei Yang 3, Wei Dai 2, Jiahao Sun 2

0 citations · 12 references · arXiv (Cornell University)

α

Published on arXiv

2602.01587

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reduces GCG Attack Success Rate from 84.2% to 1.2% on Llama-3 while maintaining 94.1% benign utility, vs. 74.3% utility for character-level baselines.

CSS + NAAT (Certified Semantic Smoothing via Stratified Randomized Ablation + Noise-Augmented Alignment Tuning)

Novel technique introduced


Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.


Key Contributions

  • Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, which partitions inputs into immutable structural prompts and mutable payloads to derive rigorous l0-norm guarantees using the Hypergeometric distribution
  • Noise-Augmented Alignment Tuning (NAAT), a fine-tuning technique that trains the LLM on ablated inputs to become a semantic denoiser, resolving the utility degradation ('inverted scaling fallacy') caused by sparse ablated contexts
  • Empirical validation on Llama-3 showing ASR reduction from 84.2% to 1.2% against gradient-based attacks while retaining 94.1% benign utility, far outperforming character-level baselines

🛡️ Threat Analysis

Input Manipulation Attack

Defends specifically against gradient-based adversarial suffix attacks (GCG, AutoDAN) — token-level perturbations optimized to bypass safety alignment. Provides certified l0-norm robustness guarantees via randomized smoothing, the canonical ML01 defense paradigm.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timedigital
Datasets
AdvBench
Applications
llm safetyjailbreak defensechatbot content moderation