ML Security Papers

Latest papers

3 papers

defense arXiv Oct 1, 2025 · Oct 2025

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Guobin Shen, Dongcheng Zhao, Haibo Tong et al. · Beijing Institute of AI Safety and Governance · Beijing Key Laboratory of Safe AI and Superalignment +2 more

Entropy-guided RL alignment trains LLMs to resist 20+ jailbreak methods using internal confidence signals, no external validators needed

Prompt Injection nlp

1 citations PDF

defense arXiv Sep 25, 2025 · Sep 2025

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Haibo Tong, Dongcheng Zhao, Guobin Shen et al. · University of Chinese Academy of Sciences · Long-term AI +3 more

Defends LLMs against multi-turn jailbreaks using bidirectional intention inference across conversation history

Prompt Injection nlp

1 citations PDF

defense arXiv Aug 15, 2025 · Aug 2025

Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

Jihang Wang, Dongcheng Zhao, Ruolin Chen et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Defends SNNs against adversarial examples by exploiting temporal structure to diversify decision boundaries and suppress cross-timestep vulnerability transfer

Input Manipulation Attack vision

PDF

Latest papers

Safety Instincts: LLMs Learn to Trust Their Internal Compass for Self-Defense

Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks

Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue