Latest papers

5 papers
attack arXiv Dec 27, 2025 · Dec 2025

Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

Jihang Wang, Dongcheng Zhao, Ruolin Chen et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Proposes improved gradient-based adversarial attacks on spiking neural networks, showing existing SNN robustness claims are overestimated.

Input Manipulation Attack vision
PDF Code
benchmark arXiv Nov 9, 2025 · Nov 2025

Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin, Guobin Shen, Zihao Yang et al. · Beijing Institute of AI Safety and Governance · Chinese Academy of Sciences +3 more

Proposes SLM multi-agent debate judge and HAJailBench to evaluate LLM jailbreak safety at 43% lower inference cost

Prompt Injection nlp
1 citations PDF
defense arXiv Oct 1, 2025 · Oct 2025

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Guobin Shen, Dongcheng Zhao, Haibo Tong et al. · Beijing Institute of AI Safety and Governance · Beijing Key Laboratory of Safe AI and Superalignment +2 more

Entropy-guided RL alignment trains LLMs to resist 20+ jailbreak methods using internal confidence signals, no external validators needed

Prompt Injection nlp
1 citations PDF
defense arXiv Sep 25, 2025 · Sep 2025

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Haibo Tong, Dongcheng Zhao, Guobin Shen et al. · University of Chinese Academy of Sciences · Long-term AI +3 more

Defends LLMs against multi-turn jailbreaks using bidirectional intention inference across conversation history

Prompt Injection nlp
1 citations PDF
defense arXiv Aug 8, 2025 · Aug 2025

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Bing Han, Feifei Zhao, Dongcheng Zhao et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +2 more

Training-free post-fine-tuning defense restoring LLM safety alignment via sparse neuron projection without retraining

Transfer Learning Attack Prompt Injection nlp
PDF