attack 2026

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

0 citations · arXiv (Cornell University)

Published on arXiv

2602.05444

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CFA² achieves state-of-the-art attack success rates on LLMs while providing a mechanistic causal interpretation of the jailbreaking process

CFA²

Novel technique introduced

Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA{$^2$} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.

Key Contributions

Models LLM safety alignment as an unobserved confounder in a causal graph and applies Pearl's Front-Door Criterion to sever confounding associations during jailbreaking
Uses Sparse Autoencoders (SAEs) to physically strip defense-related internal features, isolating core task intent without gradient-based perturbations
Reduces computationally expensive causal marginalization to a deterministic intervention, achieving state-of-the-art attack success rates with low inference complexity

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeted

Applications

llm safety alignment bypasschatbot jailbreaking

Read PDF arXiv DOI

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

The Rogue Scalpel: Activation Steering Compromises LLM Safety

ShallowJail: Steering Jailbreaks against Large Language Models

Selective Steering: Norm-Preserving Control Through Discriminative Layer Selection

MEUV: Achieving Fine-Grained Capability Activation in Large Language Models via Mutually Exclusive Unlock Vectors

Knowing without Acting: The Disentangled Geometry of Safety Mechanisms in Large Language Models

RepIt: Steering Language Models with Concept-Specific Refusal Vectors

Amnesia: Adversarial Semantic Layer Specific Activation Steering in Large Language Models