attack 2025

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

0 citations

Published on arXiv

2508.10029

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LFJ achieves an average Attack Success Rate of 94.01% across Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, and Mistral-7B-Instruct, significantly outperforming GCG and AutoDAN while evading perplexity-based filters.

Latent Fusion Jailbreak (LFJ)

Novel technique introduced

While Large Language Models (LLMs) have achieved remarkable progress, they remain vulnerable to jailbreak attacks. Existing methods, primarily relying on discrete input optimization (e.g., GCG), often suffer from high computational costs and generate high-perplexity prompts that are easily blocked by simple filters. To overcome these limitations, we propose Latent Fusion Jailbreak (LFJ), a stealthy white-box attack that operates in the continuous latent space. Unlike previous approaches, LFJ constructs adversarial representations by mathematically fusing the hidden states of a harmful query with a thematically similar benign query, effectively masking malicious intent while retaining semantic drive. We further introduce a gradient-guided optimization strategy to balance attack success and computational efficiency. Extensive evaluations on Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, and Mistral-7B-Instruct show that LFJ achieves an average Attack Success Rate (ASR) of 94.01%, significantly outperforming state-of-the-art baselines like GCG and AutoDAN while avoiding detectable input artifacts. Furthermore, we identify that thematic similarity in the latent space is a critical vulnerability in current safety alignments. Finally, we propose a latent adversarial training defense that reduces LFJ's ASR by over 80% without compromising model utility.

Key Contributions

Latent Fusion Jailbreak (LFJ): a white-box attack that mathematically fuses hidden states of a harmful query with a thematically similar benign query to mask malicious intent while driving unsafe generation
Gradient-guided optimization strategy over continuous latent space that balances attack success rate and computational efficiency while producing no detectable high-perplexity artifacts
Latent adversarial training defense that reduces LFJ's ASR by over 80% without degrading model utility, and identification of thematic similarity in latent space as a key vulnerability in safety alignment

🛡️ Threat Analysis

Input Manipulation Attack

LFJ uses gradient-guided optimization in the continuous hidden-state space to craft adversarial representations — a gradient-based adversarial attack at inference time analogous to GCG-style suffix optimization but operating on latent representations rather than discrete tokens. The proposed latent adversarial training defense is also an ML01-class countermeasure.

Details

Domains

nlp

Model Types

llm

Threat Tags

white_boxinference_timetargeteddigital

Datasets

AdvBench

Applications

safety-aligned large language modelsinstruction-tuned chatbots

Read PDF arXiv

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Bypassing AI Control Protocols via Agent-as-a-Proxy Attacks

ObliInjection: Order-Oblivious Prompt Injection Attack to LLM Agents with Multi-source Data

Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Hidden State Poisoning Attacks against Mamba-based Language Models

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization