defense 2026

SafeSeek: Universal Attribution of Safety Circuits in Language Models

0 citations

Published on arXiv

2603.23268

Model Poisoning

OWASP ML Top 10 — ML10

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Identifies backdoor circuit with 0.42% sparsity that when ablated reduces ASR from 100% to 0.4%; identifies alignment circuit whose removal increases jailbreak ASR from 0.8% to 96.9%

SafeSeek

Novel technique introduced

Mechanistic interpretability reveals that safety-critical behaviors (e.g., alignment, jailbreak, backdoor) in Large Language Models (LLMs) are grounded in specialized functional components. However, existing safety attribution methods struggle with generalization and reliability due to their reliance on heuristic, domain-specific metrics and search algorithms. To address this, we propose \ourmethod, a unified safety interpretability framework that identifies functionally complete safety circuits in LLMs via optimization. Unlike methods focusing on isolated heads or neurons, \ourmethod introduces differentiable binary masks to extract multi-granular circuits through gradient descent on safety datasets, while integrates Safety Circuit Tuning to utilize these sparse circuits for efficient safety fine-tuning. We validate \ourmethod in two key scenarios in LLM safety: \textbf{(1) backdoor attacks}, identifying a backdoor circuit with 0.42\% sparsity, whose ablation eradicates the Attack Success Rate (ASR) from 100\% $\to$ 0.4\% while retaining over 99\% general utility; \textbf{(2) safety alignment}, localizing an alignment circuit with 3.03\% heads and 0.79\% neurons, whose removal spikes ASR from 0.8\% $\to$ 96.9\%, whereas excluding this circuit during helpfulness fine-tuning maintains 96.5\% safety retention.

Key Contributions

Unified optimization-based framework (SafeSeek) for identifying multi-granular safety circuits via differentiable binary masks
Backdoor circuit detection achieving 0.42% sparsity with complete backdoor elimination (ASR 100% → 0.4%) and 99%+ utility retention
Safety Circuit Tuning method enabling helpfulness fine-tuning while maintaining 96.5% safety retention by excluding alignment circuits

🛡️ Threat Analysis

Input Manipulation Attack

Addresses safety alignment and jailbreak resistance by localizing alignment circuits whose removal dramatically increases attack success rates (0.8% → 96.9%).

Model Poisoning

Identifies and ablates backdoor circuits in LLMs, achieving complete backdoor removal (ASR 100% → 0.4%) while preserving model utility.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_timetargeted

Applications

llm safety alignmentbackdoor detection and removalsafety-preserving fine-tuning

Read PDF arXiv

SafeSeek: Universal Attribution of Safety Circuits in Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Microsaccade-Inspired Probing: Positional Encoding Perturbations Reveal LLM Misbehaviours

Addressing Corpus Knowledge Poisoning Attacks on RAG Using Sparse Attention

STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

Why Safety Probes Catch Liars But Miss Fanatics

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time