defense arXiv Aug 15, 2025 · Aug 2025
Jihang Wang, Dongcheng Zhao, Ruolin Chen et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more
Defends SNNs against adversarial examples by exploiting temporal structure to diversify decision boundaries and suppress cross-timestep vulnerability transfer
Input Manipulation Attack vision
Spiking Neural Networks (SNNs) offer a promising direction for energy-efficient and brain-inspired computing, yet their vulnerability to adversarial perturbations remains poorly understood. In this work, we revisit the adversarial robustness of SNNs through the lens of temporal ensembling, treating the network as a collection of evolving sub-networks across discrete timesteps. This formulation uncovers two critical but underexplored challenges-the fragility of individual temporal sub-networks and the tendency for adversarial vulnerabilities to transfer across time. To overcome these limitations, we propose Robust Temporal self-Ensemble (RTE), a training framework that improves the robustness of each sub-network while reducing the temporal transferability of adversarial perturbations. RTE integrates both objectives into a unified loss and employs a stochastic sampling strategy for efficient optimization. Extensive experiments across multiple benchmarks demonstrate that RTE consistently outperforms existing training methods in robust-accuracy trade-off. Additional analyses reveal that RTE reshapes the internal robustness landscape of SNNs, leading to more resilient and temporally diversified decision boundaries. Our study highlights the importance of temporal structure in adversarial learning and offers a principled foundation for building robust spiking models.
cnn Chinese Academy of Sciences · University of Chinese Academy of Sciences · Beijing Key Laboratory of Safe AI and Superalignment +1 more
defense arXiv Aug 8, 2025 · Aug 2025
Bing Han, Feifei Zhao, Dongcheng Zhao et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +2 more
Training-free post-fine-tuning defense restoring LLM safety alignment via sparse neuron projection without retraining
Transfer Learning Attack Prompt Injection nlp
While fine-tuning services drive the rapid expansion of task capabilities in large language models (LLMs), they are often accompanied by the degradation and reorganization of safety-aligned representations, making models more prone to deviating from human preferences and exposing them to emerging jailbreak risks. Existing post-fine-tuning defense methods predominantly rely on single-scale safety correction mechanisms, which struggle to achieve a robust balance among safety, model utility, and continual adaptability. We propose Multi-Level Safety Continual Projection (MSCP), a training-free post-fine-tuning safety enhancement method that implicitly aligns global and localized safety activations through coordinated multi-level representations to isolate sparse neuron clusters governing safety-sensitive behaviors. It then applies composable safety-direction projections without retraining, effectively suppressing harmful outputs under minimal parameter perturbations while preserving task performance and improving alignment with human preferences. Extensive experiments across multiple fine-tuned LLM models demonstrate that our method significantly reduce harmfulness scores and attack success rates with minimal parameter modifications, while preserving the model's utility. Furthermore, we introduce a task-specific, multi-dimensional heterogeneous safety activation clustering mechanism that enables continual defense and generalization capability against unforeseen emerging safety concerns.
llm transformer University of Chinese Academy of Sciences · Chinese Academy of Sciences · Beijing Institute of AI Safety and Governance +1 more