ML Security Papers

Latest papers

5 papers

attack arXiv Dec 27, 2025 · Dec 2025

Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

Jihang Wang, Dongcheng Zhao, Ruolin Chen et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Proposes improved gradient-based adversarial attacks on spiking neural networks, showing existing SNN robustness claims are overestimated.

Input Manipulation Attack vision

PDF Code

benchmark arXiv Nov 9, 2025 · Nov 2025

Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin, Guobin Shen, Zihao Yang et al. · Beijing Institute of AI Safety and Governance · Chinese Academy of Sciences +3 more

Proposes SLM multi-agent debate judge and HAJailBench to evaluate LLM jailbreak safety at 43% lower inference cost

Prompt Injection nlp

1 citations PDF

defense arXiv Oct 1, 2025 · Oct 2025

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Guobin Shen, Dongcheng Zhao, Haibo Tong et al. · Beijing Institute of AI Safety and Governance · Beijing Key Laboratory of Safe AI and Superalignment +2 more

Entropy-guided RL alignment trains LLMs to resist 20+ jailbreak methods using internal confidence signals, no external validators needed

Prompt Injection nlp

1 citations PDF

defense arXiv Sep 25, 2025 · Sep 2025

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Haibo Tong, Dongcheng Zhao, Guobin Shen et al. · University of Chinese Academy of Sciences · Long-term AI +3 more

Defends LLMs against multi-turn jailbreaks using bidirectional intention inference across conversation history

Prompt Injection nlp

1 citations PDF

defense arXiv Aug 8, 2025 · Aug 2025

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Bing Han, Feifei Zhao, Dongcheng Zhao et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +2 more

Training-free post-fine-tuning defense restoring LLM safety alignment via sparse neuron projection without retraining

Transfer Learning Attack Prompt Injection nlp

PDF

Latest papers

Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

Efficient LLM Safety Evaluation through Multi-Agent Debate

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Multi-Level Safety Continual Projection for Fine-Tuned Large Language Models without Retraining

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue