Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs
Wenpeng Xing 1,2, Mohan Li 3, Chunqiang Hu 4, Haitao Xu 2, Ningyu Zhang 2, Bo Lin 1, Meng Han 1,2,5
Published on arXiv
2508.10029
Input Manipulation Attack
OWASP ML Top 10 — ML01
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
LFJ achieves an average Attack Success Rate of 94.01% across Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, and Mistral-7B-Instruct, significantly outperforming GCG and AutoDAN while evading perplexity-based filters.
Latent Fusion Jailbreak (LFJ)
Novel technique introduced
While Large Language Models (LLMs) have achieved remarkable progress, they remain vulnerable to jailbreak attacks. Existing methods, primarily relying on discrete input optimization (e.g., GCG), often suffer from high computational costs and generate high-perplexity prompts that are easily blocked by simple filters. To overcome these limitations, we propose Latent Fusion Jailbreak (LFJ), a stealthy white-box attack that operates in the continuous latent space. Unlike previous approaches, LFJ constructs adversarial representations by mathematically fusing the hidden states of a harmful query with a thematically similar benign query, effectively masking malicious intent while retaining semantic drive. We further introduce a gradient-guided optimization strategy to balance attack success and computational efficiency. Extensive evaluations on Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, and Mistral-7B-Instruct show that LFJ achieves an average Attack Success Rate (ASR) of 94.01%, significantly outperforming state-of-the-art baselines like GCG and AutoDAN while avoiding detectable input artifacts. Furthermore, we identify that thematic similarity in the latent space is a critical vulnerability in current safety alignments. Finally, we propose a latent adversarial training defense that reduces LFJ's ASR by over 80% without compromising model utility.
Key Contributions
- Latent Fusion Jailbreak (LFJ): a white-box attack that mathematically fuses hidden states of a harmful query with a thematically similar benign query to mask malicious intent while driving unsafe generation
- Gradient-guided optimization strategy over continuous latent space that balances attack success rate and computational efficiency while producing no detectable high-perplexity artifacts
- Latent adversarial training defense that reduces LFJ's ASR by over 80% without degrading model utility, and identification of thematic similarity in latent space as a key vulnerability in safety alignment
🛡️ Threat Analysis
LFJ uses gradient-guided optimization in the continuous hidden-state space to craft adversarial representations — a gradient-based adversarial attack at inference time analogous to GCG-style suffix optimization but operating on latent representations rather than discrete tokens. The proposed latent adversarial training defense is also an ML01-class countermeasure.