benchmark 2026

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Yonghong Deng , Zhen Yang , Ping Jian , Xinyue Zhang , Zhongbin Guo , Chengzhi Li

Beijing Institute of Technology

0 citations

Published on arXiv

2603.08234

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Simply relocating a continuation-triggered instruction suffix outside the prompt boundary raises Attack Success Rate from 0 to as high as 0.58 on LLaMA-2-7B-Chat, explained mechanistically by safety-head vs. continuation-head competition

Continuation-Triggered Jailbreak

Novel technique introduced

With the rapid advancement of large language models (LLMs), the safety of LLMs has become a critical concern. Despite significant efforts in safety alignment, current LLMs remain vulnerable to jailbreaking attacks. However, the root causes of such vulnerabilities are still poorly understood, necessitating a rigorous investigation into jailbreak mechanisms across both academic and industrial communities. In this work, we focus on a continuation-triggered jailbreak phenomenon, whereby simply relocating a continuation-triggered instruction suffix can substantially increase jailbreak success rates. To uncover the intrinsic mechanisms of this phenomenon, we conduct a comprehensive mechanistic interpretability analysis at the level of attention heads. Through causal interventions and activation scaling, we show that this jailbreak behavior primarily arises from an inherent competition between the model's intrinsic continuation drive and the safety defenses acquired through alignment training. Furthermore, we perform a detailed behavioral analysis of the identified safety-critical attention heads, revealing notable differences in the functions and behaviors of safety heads across different model architectures. These findings provide a novel mechanistic perspective for understanding and interpreting jailbreak behaviors in LLMs, offering both theoretical insights and practical implications for improving model safety.

Key Contributions

First mechanistic interpretability analysis of the continuation-triggered jailbreak phenomenon, identifying safety-critical and continuation-critical attention heads via path patching
Demonstrates through causal interventions and activation scaling that jailbreak success arises from an inherent competition between the model's continuation drive (pre-training objective) and safety alignment (RLHF/DPO)
Reveals architectural differences in safety head behavior across LLaMA-2-7B-Chat and Qwen2.5-7B-Instruct, offering actionable insights for more robust alignment

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Applications

chatbotconversational ai

Read PDF arXiv

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift

Defenses Against Prompt Attacks Learn Surface Heuristics

SecureBreak -- A dataset towards safe and secure models

Analysing the Safety Pitfalls of Steering Vectors

False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation

A Granular Study of Safety Pretraining under Model Abliteration

Unveiling the Latent Directions of Reflection in Large Language Models