Latest papers

4 papers
defense arXiv Jan 31, 2026 · 9w ago

Towards Building Non-Fine-Tunable Foundation Models

Ziyao Wang, Nizhang Li, Pingzhi Li et al. · College Park · Macau University of Science and Technology +1 more

Defends open-source LLMs against unauthorized fine-tuning by hiding a sparse subnetwork mask, degrading adaptation without the key

Transfer Learning Attack Model Theft nlp
PDF
benchmark arXiv Nov 26, 2025 · Nov 2025

Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck

Xinyu Liu, Xu Zhang, Can Chen et al. · Michigan State University · Illinois Institute of Technology +1 more

Uses Information Bottleneck theory to analyze backdoor training dynamics and proposes a model-level stealthiness metric for backdoor attacks

Model Poisoning vision
PDF Code
attack arXiv Sep 15, 2025 · Sep 2025

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Yifan Lan, Yuanpu Cao, Weitong Zhang et al. · The Pennsylvania State University · The University of North Carolina at Chapel Hill

Gradient-optimized adversarial images hijack MLLM output preferences at inference time with transferable universal perturbations

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF Code
defense arXiv Jan 5, 2025 · Jan 2025

Layer-Level Self-Exposure and Patch: Affirmative Token Mitigation for Jailbreak Attack Defense

Yang Ouyang, Hengrui Gu, Shuhang Lin et al. · North Carolina State University · Rutgers University +4 more

Defends LLMs against jailbreaks by identifying harmful-token-generating layers and patching them via adversarial unlearning

Prompt Injection nlp
PDF Code