defense arXiv Jan 29, 2026 · 9w ago
Chuancheng Shi, Shangze Li, Wenjun Lu et al. · The University of Sydney · Nanjing University of Science and Technology +2 more
Defends LLMs, diffusion models, and MLLMs from jailbreaks by tracing and severing harmful semantic circuits via sparse autoencoders and causal path analysis
Input Manipulation Attack Prompt Injection nlpvisionmultimodalgenerative
Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.
llm diffusion vlm The University of Sydney · Nanjing University of Science and Technology · Nanjing University +1 more
defense arXiv Jan 30, 2026 · 9w ago
Jingtong Dou, Chuancheng Shi, Yemin Wang et al. · The University of Sydney · Xiamen University +2 more
Probes latent neurons in pre-trained vision models to detect AI-generated images without costly fine-tuning, outperforming black-box baselines
Output Integrity Attack vision
As generative AI achieves hyper-realism, superficial artifact detection has become obsolete. While prevailing methods rely on resource-intensive fine-tuning of black-box backbones, we propose that forgery detection capability is already encoded within pre-trained models rather than requiring end-to-end retraining. To elicit this intrinsic capability, we propose the discriminative neural anchors (DNA) framework, which employs a coarse-to-fine excavation mechanism. First, by analyzing feature decoupling and attention distribution shifts, we pinpoint critical intermediate layers where the focus of the model logically transitions from global semantics to local anomalies. Subsequently, we introduce a triadic fusion scoring metric paired with a curvature-truncation strategy to strip away semantic redundancy, precisely isolating the forgery-discriminative units (FDUs) inherently imprinted with sensitivity to forgery traces. Moreover, we introduce HIFI-Gen, a high-fidelity synthetic benchmark built upon the very latest models, to address the lag in existing datasets. Experiments demonstrate that by solely relying on these anchors, DNA achieves superior detection performance even under few-shot conditions. Furthermore, it exhibits remarkable robustness across diverse architectures and against unseen generative models, validating that waking up latent neurons is more effective than extensive fine-tuning.
transformer diffusion The University of Sydney · Xiamen University · Hong Kong Polytechnic University +1 more