defense 2026

SafeDream: Safety World Model for Proactive Early Jailbreak Detection

Bo Yan 1, Weikai Lin 2, Yada Zhu 3, Song Wang 1

0 citations

α

Published on arXiv

2604.16824

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves 1.06-1.20 turn detection lead (detects attacks before compliance) across three benchmarks while maintaining competitive false positive rates and outperforming 8 baselines

SafeDream

Novel technique introduced


Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world-model-based framework that operates as an external module without modifying the LLM's weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden states into a compact safety representation and predicts how it evolves across turns, (2) CUSUM detection that accumulates weak per-turn risk signals into reliable evidence, and (3) contrastive imagination that simultaneously rolls out attack and benign futures in latent space to issue early alarms before jailbreaks occur. On three multi-turn jailbreak benchmarks (XGuard-Train, SafeDialBench, SafeMTData) against 8 baselines, SAFEDREAM achieves the best detection timeliness across all benchmarks (1.06-1.20 turns before compliance) while maintaining competitive false positive rates and outperforming baselines in detection quality.


Key Contributions

  • Formulates proactive early jailbreak detection problem with new 'detection lead' metric measuring how early attacks are detected before LLM compliance
  • Proposes SafeDream, a lightweight external world model that predicts safety state evolution across conversation turns without modifying LLM weights
  • Introduces contrastive imagination mechanism that rolls out attack and benign futures in latent space combined with CUSUM detection for early alarms

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
XGuard-TrainSafeDialBenchSafeMTData
Applications
llm safetychatbot securityconversational ai