attack 2025

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Wenpeng Xing 1,2, Mohan Li 3, Chunqiang Hu 4, Haitao Xu 2, Ningyu Zhang 2, Bo Lin 1, Meng Han 1,2,5

0 citations

α

Published on arXiv

2508.10029

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LFJ achieves an average Attack Success Rate of 94.01% across Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, and Mistral-7B-Instruct, significantly outperforming GCG and AutoDAN while evading perplexity-based filters.

Latent Fusion Jailbreak (LFJ)

Novel technique introduced


While Large Language Models (LLMs) have achieved remarkable progress, they remain vulnerable to jailbreak attacks. Existing methods, primarily relying on discrete input optimization (e.g., GCG), often suffer from high computational costs and generate high-perplexity prompts that are easily blocked by simple filters. To overcome these limitations, we propose Latent Fusion Jailbreak (LFJ), a stealthy white-box attack that operates in the continuous latent space. Unlike previous approaches, LFJ constructs adversarial representations by mathematically fusing the hidden states of a harmful query with a thematically similar benign query, effectively masking malicious intent while retaining semantic drive. We further introduce a gradient-guided optimization strategy to balance attack success and computational efficiency. Extensive evaluations on Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, and Mistral-7B-Instruct show that LFJ achieves an average Attack Success Rate (ASR) of 94.01%, significantly outperforming state-of-the-art baselines like GCG and AutoDAN while avoiding detectable input artifacts. Furthermore, we identify that thematic similarity in the latent space is a critical vulnerability in current safety alignments. Finally, we propose a latent adversarial training defense that reduces LFJ's ASR by over 80% without compromising model utility.


Key Contributions

  • Latent Fusion Jailbreak (LFJ): a white-box attack that mathematically fuses hidden states of a harmful query with a thematically similar benign query to mask malicious intent while driving unsafe generation
  • Gradient-guided optimization strategy over continuous latent space that balances attack success rate and computational efficiency while producing no detectable high-perplexity artifacts
  • Latent adversarial training defense that reduces LFJ's ASR by over 80% without degrading model utility, and identification of thematic similarity in latent space as a key vulnerability in safety alignment

🛡️ Threat Analysis

Input Manipulation Attack

LFJ uses gradient-guided optimization in the continuous hidden-state space to craft adversarial representations — a gradient-based adversarial attack at inference time analogous to GCG-style suffix optimization but operating on latent representations rather than discrete tokens. The proposed latent adversarial training defense is also an ML01-class countermeasure.


Details

Domains
nlp
Model Types
llm
Threat Tags
white_boxinference_timetargeteddigital
Datasets
AdvBench
Applications
safety-aligned large language modelsinstruction-tuned chatbots