benchmark 2025

SoK: a Comprehensive Causality Analysis Framework for Large Language Model Security

Wei Zhao , Zhe Li , Jun Sun

0 citations · 75 references · arXiv

α

Published on arXiv

2512.04841

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Causal features from multi-level analysis achieve 97% detection accuracy for jailbreak/backdoor/fairness tasks and 95% for hallucination, with safety mechanisms concentrated in 1–2% of neurons in early-to-middle transformer layers.

SoK Causality Analysis Framework

Novel technique introduced


Large Language Models (LLMs) exhibit remarkable capabilities but remain vulnerable to adversarial manipulations such as jailbreaking, where crafted prompts bypass safety mechanisms. Understanding the causal factors behind such vulnerabilities is essential for building reliable defenses. In this work, we introduce a unified causality analysis framework that systematically supports all levels of causal investigation in LLMs, ranging from token-level, neuron-level, and layer-level interventions to representation-level analysis. The framework enables consistent experimentation and comparison across diverse causality-based attack and defense methods. Accompanying this implementation, we provide the first comprehensive survey of causality-driven jailbreak studies and empirically evaluate the framework on multiple open-weight models and safety-critical benchmarks including jailbreaks, hallucination detection, backdoor identification, and fairness evaluation. Our results reveal that: (1) targeted interventions on causally critical components can reliably modify safety behavior; (2) safety-related mechanisms are highly localized (i.e., concentrated in early-to-middle layers with only 1--2\% of neurons exhibiting causal influence); and (3) causal features extracted from our framework achieve over 95\% detection accuracy across multiple threat types. By bridging theoretical causality analysis and practical model safety, our framework establishes a reproducible foundation for research on causality-based attacks, interpretability, and robust attack detection and mitigation in LLMs. Code is available at https://github.com/Amadeuszhao/SOK_Casuality.


Key Contributions

  • First unified causality analysis framework for LLMs spanning token, neuron, layer, and representation levels, enabling reproducible comparison of causality-based attack and defense methods.
  • First comprehensive survey of causality-driven jailbreak attacks and defenses, organizing approaches across four analytical levels.
  • Empirical finding that LLM safety mechanisms are highly localized in early-to-middle layers (layers 2–12) with only 1–2% of neurons causally critical, and that causal features achieve 95–97% detection accuracy across jailbreak, backdoor, and fairness threats.

🛡️ Threat Analysis

Model Poisoning

Backdoor identification is explicitly evaluated as one of the core benchmarks; the causal framework localizes and detects backdoor-related neurons and layers in LLMs, achieving 97% detection accuracy.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetraining_time
Datasets
AdvBenchLlama2-7BQwen2.5-7BLlama3.1-8B
Applications
large language model safetyjailbreak detectionbackdoor detectionhallucination detection