defense 2025

ASGuard: Activation-Scaling Guard to Mitigate Targeted Jailbreaking Attack

Yein Park 1,2, Jungwoo Park 1,2, Jaewoo Kang 1,2

0 citations · 66 references · arXiv

α

Published on arXiv

2509.25843

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ASGuard significantly reduces tense-jailbreaking attack success rate across three LLMs while preserving general capabilities and minimizing over-refusal, achieving a Pareto-optimal safety-utility trade-off.

ASGuard (Activation-Scaling Guard)

Novel technique introduced


Large language models (LLMs), despite being safety-aligned, exhibit brittle refusal behaviors that can be circumvented by simple linguistic changes. As tense jailbreaking demonstrates that models refusing harmful requests often comply when rephrased in past tense, a critical generalization gap is revealed in current alignment methods whose underlying mechanisms are poorly understood. In this work, we introduce Activation-Scaling Guard (ASGuard), an insightful, mechanistically-informed framework that surgically mitigates this specific vulnerability. For the first step, we use circuit analysis to identify the specific attention heads causally linked to the targeted jailbreaking, the tense-changing attack. Second, we train a precise, channel-wise scaling vector to recalibrate the activation of tense vulnerable heads. Lastly, we apply it into a "preventative fine-tuning", forcing the model to learn a more robust refusal mechanism. Across three LLMs, ASGuard effectively reduces the attack success rate of targeted jailbreaking while preserving general capabilities and minimizing over refusal, achieving a Pareto-optimal balance between safety and utility. Our findings underscore how adversarial suffixes suppress the propagation of the refusal-mediating direction, based on mechanistic analysis. Furthermore, our work showcases how a deep understanding of model internals can be leveraged to develop practical, efficient, and targeted methods for adjusting model behavior, charting a course for more reliable and interpretable AI safety.


Key Contributions

  • Circuit analysis identifying specific attention heads causally linked to tense-jailbreaking vulnerability in LLMs
  • Channel-wise activation-scaling vector that recalibrates vulnerable attention heads without full model retraining
  • Preventative fine-tuning framework (ASGuard) that achieves Pareto-optimal balance between safety and utility across three LLMs

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timewhite_box
Datasets
AdvBench
Applications
llm safety alignmentjailbreak defensechatbot safety