Latest papers

4 papers
defense arXiv Mar 4, 2026 · 4w ago

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Yifan Zhu, Yibo Miao, Yinpeng Dong et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Proposes MI-UE, a theoretically grounded availability-poisoning defense that blocks unauthorized model training by reducing mutual information in poisoned image features

Data Poisoning Attack vision
PDF
benchmark arXiv Feb 27, 2026 · 5w ago

Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking

Zhicheng Fang, Jingjie Zheng, Chenxu Fu et al. · Shanghai Qi Zhi Institute · University of Melbourne +1 more

Automated multi-agent system translates jailbreak papers into executable modules for standardized, reproducible LLM robustness benchmarking

Prompt Injection nlp
PDF Code
attack arXiv Feb 7, 2026 · 8w ago

Reverse-Engineering Model Editing on Language Models

Zhiyu Sun, Minrui Luo, Yu Wang et al. · Shanghai Qi Zhi Institute · East China Normal University +3 more

Recovers private edited data from LLM parameter update matrices using spectral analysis and entropy-based prompt reconstruction

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
defense arXiv Sep 29, 2025 · Sep 2025

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Yichi Zhang, Yue Ding, Jingwen Yang et al. · arXiv · Shanghai Qi Zhi Institute +3 more

Defends Large Reasoning Models against jailbreaks by aligning CoT safety via process-supervised preference optimization with corrective interventions

Prompt Injection nlp
2 citations 1 influentialPDF