Latest papers

2 papers
defense arXiv Apr 21, 2026 · 4w ago

Benign Overfitting in Adversarial Training for Vision Transformers

Jiaming Zhang, Meng Ding, Shaopeng Fu et al. · King Abdullah University of Science and Technology · Renmin University of China +2 more

Theoretical analysis proving Vision Transformers achieve benign overfitting under adversarial training with bounded perturbations

Input Manipulation Attack vision
PDF
defense arXiv Sep 29, 2025 · Sep 2025

AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models

Zihao Zhu, Xinyu Wu, Gehan Hu et al. · The Chinese University of Hong Kong · State University of New York at Buffalo +1 more

Adversarial CoT fine-tuning teaches reasoning models to self-correct harmful drifts, improving jailbreak robustness while reducing over-refusal

Prompt Injection nlp
2 citations PDF