ML Security Papers

Latest papers

23 papers

defense arXiv Apr 28, 2026 · 23d ago

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Mengyao Du, Han Fang, Haokai Ma et al. · National University of Defense Technology · University of Science and Technology of China +2 more

Lightweight detector that identifies prompt injection attacks in web agent screenshots using visual gradient analysis and text recovery

Prompt Injection Excessive Agency multimodalnlp

PDF

attack arXiv Apr 23, 2026 · 28d ago

Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach

Guilin Deng, Silong Chen, Yuchuan Luo et al. · National University of Defense Technology · City University of Hong Kong +1 more

Gradient-based membership inference attack on federated LLMs achieving near-perfect accuracy via projection residual analysis

Membership Inference Attack nlpfederated-learning

PDF Code

defense arXiv Apr 14, 2026 · 5w ago

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

Songping Peng, Zhiheng Zhang, Daojian Zeng et al. · Hunan Normal University · Chinese Academy of Sciences +1 more

Couples weight subspace constraints with activation regularization to prevent safety degradation during LLM fine-tuning

Prompt Injection nlp

PDF

attack arXiv Apr 8, 2026 · 6w ago

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Yunhao Feng, Yifan Ding, Yingshui Tan et al. · National University of Defense Technology · Alibaba Group +2 more

Backdoor attack embedding encrypted malicious payloads in agent skills, activated by triggers during skill composition

Model Poisoning Excessive Agency nlp

PDF

attack arXiv Apr 3, 2026 · 6w ago

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

Chengyin Hu, Yuxian Dong, Yikun Guo et al. · National University of Defense Technology

Universal physical adversarial patches that disrupt semantic alignment in infrared vision-language models across classification, captioning, and VQA tasks

Input Manipulation Attack Prompt Injection multimodalvision

PDF

defense arXiv Apr 1, 2026 · 7w ago

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

Xinyu Sun, Wanwei Liu, Haoang Chi et al. · National University of Defense Technology · Nanjing University +1 more

Interpretable DNN repair using Shapley-guided fault localization and derivative-free optimization for backdoor removal, adversarial defense, and fairness

Input Manipulation Attack Model Poisoning vision

PDF

defense arXiv Mar 16, 2026 · 9w ago

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

Ruyi Zhang, Heng Gao, Songlei Jian et al. · National University of Defense Technology

LLM-powered trigger generator using reinforcement learning to detect and remove backdoors in NLP models via adversarial training

Model Poisoning nlp

PDF Code

attack arXiv Mar 4, 2026 · 11w ago

LEA: Label Enumeration Attack in Vertical Federated Learning

Wenhao Jiang, Shaojing Fu, Yuchuan Luo et al. · National University of Defense Technology

Infers private labels in vertical federated learning by enumerating label permutations and comparing gradient cosine similarity, without auxiliary data

Model Inversion Attack federated-learning

PDF

defense arXiv Feb 6, 2026 · Feb 2026

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Mengyao Du, Han Fang, Haokai Ma et al. · National University of Defense Technology · National University of Singapore +1 more

Proactive fine-tuning defense traps gradient-based jailbreak suffixes or fingerprints them, cutting LLM attack success below 0.01%

Input Manipulation Attack Prompt Injection nlp

PDF

attack arXiv Jan 24, 2026 · Jan 2026

Reconstructing Training Data from Adapter-based Federated Large Language Models

Silong Chen, Yuchuan Luo, Guilin Deng et al. · National University of Defense Technology · City University of Hong Kong

Gradient inversion attack reconstructs training text from LoRA adapter gradients in federated LLMs achieving ROUGE-1/2 over 99

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning

PDF Code

defense arXiv Jan 20, 2026 · Jan 2026

MirageNet:A Secure, Efficient, and Scalable On-Device Model Protection in Heterogeneous TEE and GPU System

Huadi Zheng, Li Cheng, Yan Ding · National University of Defense Technology

Defends edge-deployed DNN model IP from theft via TEE-GPU obfuscation, cutting overhead 16% versus GroupCover

Model Theft vision

PDF

defense arXiv Jan 6, 2026 · Jan 2026

JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification

Xi Wang, Songlei Jian, Shasha Li et al. · National University of Defense Technology

Defends LLMs against jailbreaks by unlearning dynamic information paths that reassemble harmful outputs, not just isolated parameters

Prompt Injection nlp

PDF

attack arXiv Dec 25, 2025 · Dec 2025

Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation

Tian Li, Bo Lin, Shangwen Wang et al. · National University of Defense Technology

Backdoors RACG retrievers to inject vulnerable code into LLM context, achieving 40%+ vulnerable code generation while bypassing defenses

Model Poisoning Prompt Injection nlpgenerative

PDF

attack arXiv Dec 16, 2025 · Dec 2025

Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space

Xingfu Zhou, Pengfei Wang · National University of Defense Technology

Poisons LLM agent reasoning by style-transferring retrieved docs into pathological tones, bypassing content filters without altering facts

Prompt Injection nlp

2 citations PDF

attack TDSC Dec 16, 2025 · Dec 2025

Optimizing the Adversarial Perturbation with a Momentum-based Adaptive Matrix

Wei Tao, Sheng Long, Xin Liu et al. · National University of Defense Technology · Academy of Military Science +3 more

AdaMI: momentum-based adaptive matrix attack that provably improves adversarial transferability over PGD and MI-FGSM across networks

Input Manipulation Attack vision

PDF

defense medRxiv Dec 5, 2025 · Dec 2025

The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

Jiale Zhao, Xing Mou, Jinlin Wu et al. · National University of Defense Technology · Chinese Academy of Sciences +3 more

Defends Medical MLLMs against cross-modality jailbreaks by grafting safety knowledge from base models during fine-tuning via parameter-space intervention

Transfer Learning Attack Prompt Injection multimodalvisionnlp

PDF

benchmark arXiv Nov 22, 2025 · Nov 2025

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

Yunyi Zhang, Shibo Cui, Baojun Liu et al. · Tsinghua University · National University of Defense Technology +1 more

Discovers LLM apps routinely exceed intended capability boundaries, with 17 apps performing malicious tasks without any adversarial prompting

Excessive Agency Prompt Injection nlp

PDF

attack Chinese Conference on Pattern ... Nov 10, 2025 · Nov 2025

FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Yulin Chen, Zeyuan Wang, Tianyuan Yu et al. · National University of Defense Technology

Gradient-based adversarial framework fools CLIP image-quality metrics, then detects tampered images via grayscale color-channel sensitivity

Input Manipulation Attack visionmultimodal

PDF

defense arXiv Oct 9, 2025 · Oct 2025

Provably Robust Adaptation for Language-Empowered Foundation Models

Yuni Lai, Xiaoyu Xue, Linghui Shen et al. · The Hong Kong Polytechnic University · National University of Defense Technology +2 more

Certifiably robust few-shot classifier for CLIP/GraphCLIP using trimmed-mean prototypes and randomized smoothing against support-set poisoning

Data Poisoning Attack visiongraphmultimodal

1 citations PDF

Language-empowered foundation models (LeFMs), such as CLIP and GraphCLIP, have transformed multimodal learning by aligning visual (or graph) features with textual representations, enabling powerful downstream capabilities like few-shot learning. However, the reliance on small, task-specific support datasets collected in open environments exposes these models to poisoning attacks, where adversaries manipulate the support samples to degrade performance. Existing defenses rely on empirical strategies, which lack formal guarantees and remain vulnerable to unseen and adaptive attacks. Certified robustness offers provable guarantees but has been largely unexplored for few-shot classifiers based on LeFMs. This study seeks to fill these critical gaps by proposing the first provably robust few-shot classifier that is tailored for LeFMs. We term our model Language-empowered Few-shot Certification (\textbf{LeFCert}). It integrates both textual and feature embeddings with an adaptive blending mechanism. To achieve provable robustness, we propose a twofold trimmed mean prototype and derive provable upper and lower bounds for classification scores, enabling certification under worst-case poisoning scenarios. To further enhance the performance, we extend LeFCert with two variants by considering a more realistic and tighter attack budget: LeFCert-L incorporates randomized smoothing to provide Lipschitz continuity and derive robustness under dual budget constraints, and LeFCert-C provides collective certification for scenarios where attackers distribute a shared poisoning budget across multiple samples. Experiments demonstrate that LeFCert achieves state-of-the-art performance, significantly improving both clean and certified accuracy compared to existing baselines. Despite its advanced robustness mechanisms, LeFCert is computationally efficient, making it practical for real-world applications.

vlm transformer multimodal The Hong Kong Polytechnic University · National University of Defense Technology · Shanghai Jiao Tong University +1 more

PDF arXiv DOI

attack arXiv Oct 1, 2025 · Oct 2025

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Nanxiang Jiang, Zhaoxin Fan, Enhan Kang et al. · Beihang University · University of Science and Technology of China +3 more

Attacks concept erasure safety in Flux T2I models by exploiting attention localization, reactivating suppressed content via a 3.57 MB plug-and-play adapter

Input Manipulation Attack visiongenerative

1 citations PDF Code

Loading more papers…

Latest papers

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Toward Efficient Membership Inference Attacks against Federated Large Language Models: A Projection Residual Approach

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Revealing Physical-World Semantic Vulnerabilities: Universal Adversarial Patches for Infrared Vision-Language Models

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

BadLLM-TG: A Backdoor Defender powered by LLM Trigger Generator

LEA: Label Enumeration Attack in Vertical Federated Learning

TrapSuffix: Proactive Defense Against Adversarial Suffixes in Jailbreaking

Reconstructing Training Data from Adapter-based Federated Large Language Models

MirageNet:A Secure, Efficient, and Scalable On-Device Model Protection in Heterogeneous TEE and GPU System

JPU: Bridging Jailbreak Defense and Unlearning via On-Policy Path Rectification

Exploring the Security Threats of Retriever Backdoors in Retrieval-Augmented Code Generation

Reasoning-Style Poisoning of LLM Agents via Stealthy Style Transfer: Process-Level Attacks and Runtime Monitoring in RSV Space

Optimizing the Adversarial Perturbation with a Momentum-based Adaptive Matrix

The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

Beyond Jailbreak: Unveiling Risks in LLM Applications Arising from Blurred Capability Boundaries

FoCLIP: A Feature-Space Misalignment Framework for CLIP-Based Image Manipulation and Detection

Provably Robust Adaptation for Language-Empowered Foundation Models

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue