defense 2025

Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan , Ayushi Mehrotra

1 citations · 12 references · arXiv

α

Published on arXiv

2511.18721

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

The (k, ε)-unstable framework yields tighter, more realistic safety certificates for SmoothLLM by modeling observed exponential ASR decay, replacing the overly conservative deterministic guarantee that rarely holds in practice.

(k, ε)-unstable probabilistic certificate

Novel technique introduced


The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict "k-unstable" assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, "(k, $\varepsilon$)-unstable," to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM's defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.


Key Contributions

  • Introduces the (k, ε)-unstable probabilistic assumption to relax the overly strict deterministic k-unstable assumption underlying SmoothLLM's certificate
  • Derives new data-informed lower bounds on SmoothLLM's defense probability by modeling empirically observed exponential decay of attack success rates under character perturbation
  • Provides practitioners with actionable, evidence-based certification thresholds applicable to both gradient-based (GCG) and semantic (PAIR) jailbreak attacks

🛡️ Threat Analysis

Input Manipulation Attack

The paper develops a certified robustness framework for SmoothLLM against gradient-based adversarial suffix attacks (GCG), deriving data-informed lower bounds on defense probability — this is a certified robustness defense against adversarial input manipulation at inference time.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_time
Datasets
GCG attack benchmarksPAIR attack benchmarks
Applications
llm safety certificationjailbreak defense