defense 2025

Speculative Safety-Aware Decoding

Xuekang Wang 1, Shengyu Zhu 2, Xueqi Cheng 2

0 citations

α

Published on arXiv

2508.17739

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

SSD equips large LLMs with a desired safety property against jailbreak attacks without parameter tuning, preserving utility on benign queries while accelerating inference through speculative sampling.

Speculative Safety-Aware Decoding (SSD)

Novel technique introduced


Despite extensive efforts to align Large Language Models (LLMs) with human values and safety rules, jailbreak attacks that exploit certain vulnerabilities continuously emerge, highlighting the need to strengthen existing LLMs with additional safety properties to defend against these attacks. However, tuning large models has become increasingly resource intensive and may have difficulty ensuring consistent performance. We introduce Speculative Safety-Aware Decoding (SSD), a lightweight decoding-time approach that equips LLMs with the desired safety property while accelerating inference. We assume that there exists a small language model that possesses this desired property. SSD integrates speculative sampling during decoding and leverages the match ratio between the small and composite models to quantify jailbreak risks. This enables SSD to dynamically switch between decoding schemes to prioritize utility or safety, to handle the challenge of different model capacities. The output token is then sampled from a new distribution that combines the distributions of the original and the small models. Experimental results show that SSD successfully equips the large model with the desired safety property, and also allows the model to remain helpful to benign queries. Furthermore, SSD accelerates the inference time, thanks to the speculative sampling design.


Key Contributions

  • Speculative Safety-Aware Decoding (SSD): a tuning-free, decoding-time jailbreak defense that uses a small safety-aligned draft model to guide the large model's output distribution
  • A match-ratio metric between the small and large model token distributions to dynamically quantify jailbreak risk and switch between Intersection (utility) and Union (safety) decoding schemes
  • Simultaneous inference acceleration via speculative sampling, reducing compute overhead compared to prior decoding-time defenses

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
AdvBench
Applications
llm safety alignmentjailbreak defensechatbot safety