Latest papers

10 papers
defense arXiv Mar 1, 2026 · 5w ago

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li, Zhening Liu, Zijian Li et al. · Lingnan University · The Hong Kong University of Science and Technology

Defends LLM safety alignment during fine-tuning by scoring and removing unsafe tokens via loss-difference between safety-degraded and utility-oriented reference models

Transfer Learning Attack Prompt Injection nlp
PDF Code
defense arXiv Feb 2, 2026 · 9w ago

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Zehua Cheng, Jianwei Yang, Wei Dai et al. · University of Oxford · FLock.io +1 more

Proposes certifiably robust LLM jailbreak defense via randomized ablation smoothing, cutting GCG attack success from 84% to 1%

Input Manipulation Attack Prompt Injection nlp
PDF
defense arXiv Jan 19, 2026 · 11w ago

Proxy Robustness in Vision Language Models is Effortlessly Transferable

Xiaowei Fu, Fuxiang Huang, Lei Zhang · Chongqing University · Lingnan University

Transfers adversarial robustness across heterogeneous CLIP variants via proxy distillation, boosting VLM defense without costly adversarial teacher training

Input Manipulation Attack visionmultimodal
PDF Code
defense arXiv Jan 17, 2026 · 11w ago

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Zimo Ji, Daoyuan Wu, Wenyuan Jiang et al. · Hong Kong University of Science and Technology · Lingnan University +3 more

Proposes SEAgent, a mandatory access control framework that blocks privilege escalation attacks in LLM agent tool use via information flow monitoring and ABAC policies

Prompt Injection Excessive Agency nlp
1 citations PDF
defense arXiv Dec 8, 2025 · Dec 2025

When Privacy Meets Recovery: The Overlooked Half of Surrogate-Driven Privacy Preservation for MLLM Editing

Siyuan Xu, Yibing Liu, Peilin Chen et al. · City University of Hong Kong · Hon Hai Research Institute +1 more

Defends user visual privacy in cloud MLLM image editing via surrogate substitution and a diffusion-based recovery framework

Sensitive Information Disclosure visionmultimodal
PDF
survey arXiv Nov 19, 2025 · Nov 2025

Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks

Zimo Ji, Xunguang Wang, Zongjie Li et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology +3 more

SoK paper taxonomizes IPI defenses for LLM agents, identifies six bypass root causes, and proposes three novel adaptive attacks

Prompt Injection nlp
PDF
defense International Journal of Compu... Nov 14, 2025 · Nov 2025

Unsupervised Robust Domain Adaptation: Paradigm, Theory and Algorithm

Fuxiang Huang, Xiaowei Fu, Shiyu Ye et al. · Chongqing University · Lingnan University +3 more

Defends unsupervised domain adaptation models against adversarial attacks via disentangled distillation post-training

Input Manipulation Attack vision
PDF
defense arXiv Oct 13, 2025 · Oct 2025

Learning to Watermark: A Selective Watermarking Framework for Large Language Models via Multi-Objective Optimization

Chenrui Wang, Junyi Shu, Billy Chiu et al. · Harbin Institute of Technology · Lingnan University +2 more

Selective LLM text watermarking via a learned MLP that balances detectability and text quality using multi-objective optimization

Output Integrity Attack nlp
PDF Code
benchmark arXiv Oct 10, 2025 · Oct 2025

On the Fairness of Privacy Protection: Measuring and Mitigating the Disparity of Group Privacy Risks for Differentially Private Machine Learning

Zhi Yang, Changwu Huang, Ke Tang et al. · Southern University of Science and Technology · Lingnan University

Proposes a tighter membership inference game to audit group privacy risk disparity and adaptive DP-SGD to equalize protection across demographic groups

Membership Inference Attack tabularvision
PDF Code
benchmark arXiv Aug 17, 2025 · Aug 2025

MCPSecBench: A Systematic Security Benchmark and Playground for Testing Model Context Protocols

Yixuan Yang, Cuifeng Gao, Daoyuan Wu et al. · Eurecom · Lingnan University +2 more

Benchmarks MCP security across Claude, OpenAI, and Cursor, uncovering 17 attack types with existing defenses below 30% effectiveness

Insecure Plugin Design Prompt Injection nlp
PDF