attack 2026

ShallowJail: Steering Jailbreaks against Large Language Models

Shang Liu , Hanyu Pei , Zeyan Liu

0 citations · 43 references · arXiv (Cornell University)

α

Published on arXiv

2602.07107

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

ShallowJail achieves an attack success rate exceeding 90% on Qwen2.5-7B-Instruct by exploiting shallow safety alignment via inference-time activation steering, without requiring gradient computation or manual prompt engineering.

ShallowJail

Novel technique introduced


Large Language Models(LLMs) have been successful in numerous fields. Alignment has usually been applied to prevent them from harmful purposes. However, aligned LLMs remain vulnerable to jailbreak attacks that deliberately mislead them into producing harmful outputs. Existing jailbreaks are either black-box, using carefully crafted, unstealthy prompts, or white-box, requiring resource-intensive computation. In light of these challenges, we introduce ShallowJail, a novel attack that exploits shallow alignment in LLMs. ShallowJail can misguide LLMs' responses by manipulating the initial tokens during inference. Through extensive experiments, we demonstrate the effectiveness of ShallowJail, which substantially degrades the safety of state-of-the-art LLM responses. Our code is available at https://github.com/liuup/ShallowJail.


Key Contributions

  • Re-examines shallow safety alignment and demonstrates it can be exploited: safety mechanisms over-rely on initial generated tokens, creating a steerability vulnerability.
  • Proposes a task-agnostic two-stage attack: (1) compute compliance-inducing activation steering vectors offline, then (2) inject them into the model's hidden states during inference to bypass refusal behavior.
  • Achieves >90% attack success rate on Qwen2.5-7B-Instruct and demonstrates broad effectiveness across state-of-the-art LLMs without additional training or gradient optimization.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
AdvBench
Applications
llm chatbotsai assistantsaligned language models