defense 2026

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

0 citations · 49 references · arXiv

Published on arXiv

2601.21900

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

TraceRouter significantly outperforms state-of-the-art safety intervention baselines in adversarial robustness while better preserving general model utility across diverse foundation model architectures.

TraceRouter

Novel technique introduced

Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.

Key Contributions

Path-level representation hypothesis: harmful semantics propagate through distributed cross-layer circuits rather than isolated neurons, motivating circuit-level intervention.
TraceRouter 'Discover-Trace-Disconnect' framework using sparse autoencoders (SAEs) and Feature Influence Scores (FIS) to identify and sever causal propagation pathways of illicit semantics.
Universal applicability across diffusion models, LLMs, and multimodal LLMs with superior adversarial robustness–utility trade-off over prior localized-suppression baselines.

🛡️ Threat Analysis

Input Manipulation Attack

The framework defends against adversarial manipulation of foundation models — including adversarial prompts and attacks that induce harmful outputs at inference time — across diffusion models, LLMs, and MLLMs. The 'adversarial robustness' evaluation explicitly targets input-level adversarial attacks.

Details

Domains

nlpvisionmultimodalgenerative

Model Types

llmdiffusionvlm

Threat Tags

white_boxinference_timedigital

Datasets

AdvBench

Applications

text generationimage generationmultimodal ai systemslarge language modelsdiffusion models

Read PDF arXiv DOI

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

GHOST: Hallucination-Inducing Image Generation for Multimodal LLMs

CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization

Q-MLLM: Vector Quantization for Robust Multimodal Large Language Model Security

Randomized Smoothing Meets Vision-Language Models

Rethinking Jailbreak Detection of Large Vision Language Models with Representational Contrastive Scoring

Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling

NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation

VALD: Multi-Stage Vision Attack Detection for Efficient LVLM Defense