Dongcheng Zhao

defense arXiv Sep 25, 2025 · Sep 2025

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Haibo Tong, Dongcheng Zhao, Guobin Shen et al. · University of Chinese Academy of Sciences · Long-term AI +3 more

Defends LLMs against multi-turn jailbreaks using bidirectional intention inference across conversation history

Prompt Injection nlp

1 citations PDF

defense arXiv Oct 1, 2025 · Oct 2025

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Guobin Shen, Dongcheng Zhao, Haibo Tong et al. · Beijing Institute of AI Safety and Governance · Beijing Key Laboratory of Safe AI and Superalignment +2 more

Entropy-guided RL alignment trains LLMs to resist 20+ jailbreak methods using internal confidence signals, no external validators needed

Prompt Injection nlp

1 citations PDF

benchmark arXiv Nov 9, 2025 · Nov 2025

Efficient LLM Safety Evaluation through Multi-Agent Debate

Dachuan Lin, Guobin Shen, Zihao Yang et al. · Beijing Institute of AI Safety and Governance · Chinese Academy of Sciences +3 more

Proposes SLM multi-agent debate judge and HAJailBench to evaluate LLM jailbreak safety at 43% lower inference cost

Prompt Injection nlp

1 citations PDF

attack arXiv Dec 27, 2025 · Dec 2025

Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks

Jihang Wang, Dongcheng Zhao, Ruolin Chen et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +1 more

Proposes improved gradient-based adversarial attacks on spiking neural networks, showing existing SNN robustness claims are overestimated.

Input Manipulation Attack vision

PDF Code

Papers in Database (4)

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Efficient LLM Safety Evaluation through Multi-Agent Debate

Towards Reliable Evaluation of Adversarial Robustness for Spiking Neural Networks