Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs
Yao Zhou 1,2, Zeen Song 1,2, Wenwen Qiang 1,2, Fengge Wu 1,2, Shuyi Zhou 2,3, Changwen Zheng 1,2, Hui Xiong 4
1 Institute of Software Chinese Academy of Sciences
2 University of Chinese Academy of Sciences
3 Institute of Information Engineering Chinese Academy of Sciences
Published on arXiv
2602.05444
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
CFA² achieves state-of-the-art attack success rates on LLMs while providing a mechanistic causal interpretation of the jailbreaking process
CFA²
Novel technique introduced
Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA{$^2$} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
Key Contributions
- Models LLM safety alignment as an unobserved confounder in a causal graph and applies Pearl's Front-Door Criterion to sever confounding associations during jailbreaking
- Uses Sparse Autoencoders (SAEs) to physically strip defense-related internal features, isolating core task intent without gradient-based perturbations
- Reduces computationally expensive causal marginalization to a deterministic intervention, achieving state-of-the-art attack success rates with low inference complexity