ML Security Papers

Latest papers

84 papers

defense arXiv Apr 5, 2026 · 3d ago

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Siyuan Li, Zehao Liu, Xi Lin et al. · Shanghai Jiao Tong University · University of Illinois Urbana-Champaign +1 more

Multi-agent cooperative defense system that adapts across rounds to counter evolving LLM jailbreak attacks through deception and forensic analysis

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Apr 4, 2026 · 4d ago

LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild

Fei Wu, Dagong Lu, Mufeng Yao et al. · Shanghai Jiao Tong University · INTSIG Information

Deepfake detector combining global semantic analysis and local patch-level forensics for robust detection across manipulation methods

Output Integrity Attack visionmultimodal

PDF

defense arXiv Apr 4, 2026 · 4d ago

HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild

Fei Wu, Dagong Lu, Mufeng Yao et al. · Shanghai Jiao Tong University · INTSIG Information

Heterogeneous ensemble detector combining multi-scale features and diverse backbones to identify AI-generated images under real-world distortions

Output Integrity Attack visiongenerative

PDF

defense arXiv Mar 30, 2026 · 9d ago

Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree

Fei Wu, Guanghao Ding, Zijian Niu et al. · Shanghai Jiao Tong University

Combines lightweight artifact detectors with multimodal LLMs via fuzzy decision trees for generalizable AI-generated image detection

Output Integrity Attack visionmultimodal

PDF

defense arXiv Mar 25, 2026 · 14d ago

DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

Hongyi Miao, Jun Jia, Xincheng Wang et al. · Shandong University · Shanghai Jiao Tong University +4 more

Data poisoning defense that protects private photo datasets from VLM fine-tuning attacks that extract identity-affiliation relationships

Data Poisoning Attack Sensitive Information Disclosure visionnlpmultimodal

PDF

Recent advances in visual-language alignment have endowed vision-language models (VLMs) with fine-grained image understanding capabilities. However, this progress also introduces new privacy risks. This paper first proposes a novel privacy threat model named identity-affiliation learning: an attacker fine-tunes a VLM using only a few private photos of a target individual, thereby embedding associations between the target facial identity and their private property and social relationships into the model's internal representations. Once deployed via public APIs, this model enables unauthorized exposure of the target user's private information upon input of their photos. To benchmark VLMs' susceptibility to such identity-affiliation leakage, we introduce the first identity-affiliation dataset comprising seven typical scenarios appearing in private photos. Each scenario is instantiated with multiple identity-centered photo-description pairs. Experimental results demonstrate that mainstream VLMs like LLaVA, Qwen-VL, and MiniGPT-v2, can recognize facial identities and infer identity-affiliation relationships by fine-tuning on small-scale private photographic dataset, and even on synthetically generated datasets. To mitigate this privacy risk, we propose DP2-VL, the first Dataset Protection framework for private photos that leverages Data Poisoning. Though optimizing imperceptible perturbations by pushing the original representations toward an antithetical region, DP2-VL induces a dataset-level shift in the embedding space of VLMs'encoders. This shift separates protected images from clean inference images, causing fine-tuning on the protected set to overfit. Extensive experiments demonstrate that DP2-VL achieves strong generalization across models, robustness to diverse post-processing operations, and consistent effectiveness across varying protection ratios.

vlm transformer multimodal Shandong University · Shanghai Jiao Tong University · Donghua University +3 more

PDF arXiv

defense arXiv Mar 25, 2026 · 14d ago

AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication

Jie Song, Jun Jia, Wei Sun et al. · Macao Polytechnic University · Shanghai Jiao Tong University +2 more

Medical image fusion model embedding visible copyright watermarks in outputs, removable only with authentication keys

Model Theft Output Integrity Attack visionmultimodal

PDF

attack arXiv Mar 23, 2026 · 16d ago

Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems

Chengyin Hu, Yikun Guo, Yuxian Dong et al. · China University of Petroleum-Beijing · University of Electronic Science and Technology of China +3 more

Universal adversarial patch attack on infrared pedestrian detectors using parameterized Bézier curves and cold patches

Input Manipulation Attack vision

PDF

attack arXiv Mar 20, 2026 · 19d ago

Trojan's Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance

Fazhong Liu, Zhuoyan Chen, Tu Lan et al. · Shanghai Jiao Tong University

Supply chain attack embedding malicious operational narratives in autonomous coding agent bootstrap guidance, achieving up to 64% success rate

AI Supply Chain Attacks Prompt Injection Insecure Plugin Design nlp

PDF

defense arXiv Mar 16, 2026 · 23d ago

Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats

Bingxue Zhang, Yang Gao, Feida Zhu et al. · University of Shanghai for Science and Technology · Singapore Management University +1 more

Universal adversarial defense against heterogeneous generative models using feature-space alignment to protect images from unauthorized editing

Input Manipulation Attack visiongenerative

PDF

defense arXiv Mar 16, 2026 · 23d ago

Counterexample Guided Branching via Directional Relaxation Analysis in Complete Neural Network Verification

Jingyang Li, Fu Song, Guoqiang Li · Shanghai Jiao Tong University · Chinese Academy of Sciences

Reformulates neural network verification as CEGAR loop, using spurious counterexamples to guide branching and tighten robustness proofs

Input Manipulation Attack vision

PDF

attack arXiv Mar 14, 2026 · 25d ago

Inevitable Encounters: Backdoor Attacks Involving Lossy Compression

Qian Li, Yunuo Chen, Yuntian Chen · Shanghai Jiao Tong University · Eastern Institute of Technology

Backdoor attacks adapted for lossy compression using ROI coding to preserve trigger information in JPEG bitstreams

Model Poisoning Data Poisoning Attack vision

PDF

defense arXiv Mar 12, 2026 · 27d ago

EmbTracker: Traceable Black-box Watermarking for Federated Language Models

Haodong Zhao, Jinming Hu, Yijie Bai et al. · Shanghai Jiao Tong University · Ant Group +2 more

Embeds per-client backdoor watermarks in federated LMs to trace model leaks to individual culprits via black-box queries

Model Theft Model Poisoning nlpfederated-learningmultimodal

PDF

defense arXiv Mar 10, 2026 · 29d ago

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

Yinpeng Wu, Yitong Chen, Lixiang Wang et al. · Shanghai Jiao Tong University

TEE-based LLM serving system that protects model weights and user data from compromised OS kernels on mobile devices

Model Theft Sensitive Information Disclosure nlp

PDF

attack arXiv Mar 9, 2026 · 4w ago

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

Junxian Li, Tu Lan, Haozhen Tan et al. · Shanghai Jiao Tong University

Backdoor attack on VLM GUI agents that induces excessive latency via RL-injected trigger-aware long reasoning chains

Model Poisoning multimodalvisionnlp

PDF Code

survey arXiv Mar 2, 2026 · 5w ago

From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions

Zhihang Deng, Jiaping Gui, Weinan Zhang · Shanghai Innovation Institute · Shanghai Jiao Tong University

Surveys prompt injection, toolchain abuse, and agent network threats across LLM agentic systems and web-scale deployments

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Feb 28, 2026 · 5w ago

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

Haodong Zhao, Jinming Hu, Zhaomin Wu et al. · Shanghai Jiao Tong University · National University of Singapore +1 more

Defends federated LLM instruction tuning against interspersed backdoor poisoning using frequency-domain gradient signals and global clustering

Model Poisoning Data Poisoning Attack nlpfederated-learning

PDF Code

benchmark arXiv Feb 26, 2026 · 5w ago

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Xiaosen Wang, Zhijin Ge, Bohan Liu et al. · Huazhong University of Science and Technology · Xidian University +3 more

Surveys 100+ transfer-based adversarial attacks, proposes unified benchmark framework to address unfair comparisons in the field

Input Manipulation Attack vision

PDF Code

attack arXiv Feb 17, 2026 · 7w ago

Revisiting Backdoor Threat in Federated Instruction Tuning from a Signal Aggregation Perspective

Haodong Zhao, Jinming Hu, Gongshen Liu · Shanghai Jiao Tong University

Reveals distributed backdoor attacks via low-concentration poisoned data across benign FL clients defeat all existing defenses

Model Poisoning Data Poisoning Attack Training Data Poisoning nlpfederated-learning

PDF

defense arXiv Feb 2, 2026 · 9w ago

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen, Zhida He, Han Qi et al. · Shanghai AI Laboratory · Shanghai Jiao Tong University +1 more

Multi-agent RL co-evolves an LLM attacker and defender, generating novel jailbreaks to train robust safety alignment against unseen prompts

Prompt Injection nlpreinforcement-learning

PDF Code

defense arXiv Jan 27, 2026 · 10w ago

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Lige Huang, Zicheng Liu, Jie Zhang et al. · Shanghai Artificial Intelligence Laboratory · Institute of Information Engineering +1 more

Automates LLM jailbreak guardrail hardening via iterative red-blue adversarial game without model parameter updates

Prompt Injection nlp

PDF

Loading more papers…

Latest papers

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

LOGER: Local--Global Ensemble for Robust Deepfake Detection in the Wild

HEDGE: Heterogeneous Ensemble for Detection of AI-GEnerated Images in the Wild

Generalizable Detection of AI Generated Images with Large Models and Fuzzy Decision Tree

DP^2-VL: Private Photo Dataset Protection by Data Poisoning for Vision-Language Models

AMIF: Authorizable Medical Image Fusion Model with Built-in Authentication

Thermal Topology Collapse: Universal Physical Patch Attacks on Infrared Vision Systems

Trojan's Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance

Architecture-Agnostic Feature Synergy for Universal Defense Against Heterogeneous Generative Threats

Counterexample Guided Branching via Directional Relaxation Analysis in Complete Neural Network Verification

Inevitable Encounters: Backdoor Attacks Involving Lossy Compression

EmbTracker: Traceable Black-box Watermarking for Federated Language Models

FlexServe: A Fast and Secure LLM Serving System for Mobile Devices with Flexible Resource Isolation

SlowBA: An efficiency backdoor attack towards VLM-based GUI agents

From Secure Agentic AI to Secure Agentic Web: Challenges, Threats, and Future Directions

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

Devling into Adversarial Transferability on Image Classification: Review, Benchmark, and Evaluation

Revisiting Backdoor Threat in Federated Instruction Tuning from a Signal Aggregation Perspective

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue