α

Published on arXiv

2602.05444

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

CFA² achieves state-of-the-art attack success rates on LLMs while providing a mechanistic causal interpretation of the jailbreaking process

CFA²

Novel technique introduced


Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the Causal Front-Door Adjustment Attack (CFA{$^2$}) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that CFA{$^2$} achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.


Key Contributions

  • Models LLM safety alignment as an unobserved confounder in a causal graph and applies Pearl's Front-Door Criterion to sever confounding associations during jailbreaking
  • Uses Sparse Autoencoders (SAEs) to physically strip defense-related internal features, isolating core task intent without gradient-based perturbations
  • Reduces computationally expensive causal marginalization to a deterministic intervention, achieving state-of-the-art attack success rates with low inference complexity

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Applications
llm safety alignment bypasschatbot jailbreaking