Di Wang

Papers in Database (5)

defense arXiv Apr 14, 2026 · 5w ago

Understanding and Improving Continuous Adversarial Training for LLMs via In-context Learning Theory

Shaopeng Fu, Di Wang · King Abdullah University of Science and Technology

Proves why continuous adversarial training defends LLMs against jailbreaks and proposes embedding regularization for better robustness

Input Manipulation Attack Prompt Injection nlp
PDF Code
defense arXiv Sep 17, 2025 · Sep 2025

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu, Yao Wan, Zhikun Zhang et al. · Huazhong University of Science and Technology · Zhejiang University +4 more

Defends code LLMs against sensitive training data extraction by selectively unlearning memorized PII and credentials via gradient ascent

Model Inversion Attack Sensitive Information Disclosure nlp
PDF
defense arXiv Mar 9, 2026 · 10w ago

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu et al. · King Abdullah University of Science and Technology · China University of Petroleum-Beijing +1 more

Defends VLMs against visual jailbreaks via label-free fine-tuning on neutral threat-image tasks to shape safety-oriented personas

Prompt Injection visionmultimodalnlp
PDF
attack arXiv Apr 14, 2026 · 5w ago

CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

Qi Li, Cheng-Long Wang, Yinzhi Cao et al. · King Abdullah University of Science and Technology · National University of Singapore +1 more

Membership inference attacks on subset-trained models revealing both training membership and selection participation across data pipelines

Membership Inference Attack visionnlp
PDF
defense arXiv Apr 21, 2026 · 4w ago

Benign Overfitting in Adversarial Training for Vision Transformers

Jiaming Zhang, Meng Ding, Shaopeng Fu et al. · King Abdullah University of Science and Technology · Renmin University of China +2 more

Theoretical analysis proving Vision Transformers achieve benign overfitting under adversarial training with bounded perturbations

Input Manipulation Attack vision
PDF