Di Wang

Papers in Database (2)

defense arXiv Mar 9, 2026 · 28d ago

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu et al. · King Abdullah University of Science and Technology · China University of Petroleum-Beijing +1 more

Defends VLMs against visual jailbreaks via label-free fine-tuning on neutral threat-image tasks to shape safety-oriented personas

Prompt Injection visionmultimodalnlp
PDF
defense arXiv Sep 17, 2025 · Sep 2025

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu, Yao Wan, Zhikun Zhang et al. · Huazhong University of Science and Technology · Zhejiang University +4 more

Defends code LLMs against sensitive training data extraction by selectively unlearning memorized PII and credentials via gradient ascent

Model Inversion Attack Sensitive Information Disclosure nlp
PDF