ML Security Papers

Latest papers

20 papers

benchmark arXiv Mar 21, 2026 · 18d ago

Unveiling the Security Risks of Federated Learning in the Wild: From Research to Practice

Jiahao Chen, Zhiming Zhao, Yuwen Pu et al. · Zhejiang University · Chongqing University +1 more

Measurement study showing FL poisoning attacks are less effective in practice than research suggests due to heterogeneity and stability constraints

Data Poisoning Attack visionnlptabularfederated-learning

PDF Code

Federated learning (FL) has attracted substantial attention in both academia and industry, yet its practical security posture remains poorly understood. In particular, a large body of poisoning research is evaluated under idealized assumptions about attacker participation, client homogeneity, and success metrics, which can substantially distort how security risks are perceived in deployed FL systems. This paper revisits FL security from a measurement perspective. We systematize three major sources of mismatch between research and practice: unrealistic poisoning threat models, the omission of hybrid heterogeneity, and incomplete metrics that overemphasize peak attack success while ignoring stability and utility cost. To study these gaps, we build TFLlib, a uniform evaluation framework that supports image, text, and tabular FL tasks and re-implements representative poisoning attacks under practical settings. Our empirical study shows that idealized evaluation often overstates security risk. Under practical settings, attack performance becomes markedly more dataset-dependent and unstable, and several attacks that appear consistently strong in idealized FL lose effectiveness or incur clear benign-task degradation once practical constraints are enforced. These findings further show that final-round attack success alone is insufficient for security assessment; practical measurement must jointly consider effectiveness, temporal stability, and collateral utility loss. Overall, this work argues that many conclusions in the FL poisoning literature are not directly transferable to real deployments. By tightening the threat model and using measurement protocols aligned with practice, we provide a more realistic view of the security risks faced by contemporary FL systems and distill concrete guidance for future FL security evaluation. Our code is available at https://github.com/xaddwell/TFLlib

federated cnn transformer Zhejiang University · Chongqing University · Southeast University

PDF arXiv Code

attack arXiv Feb 15, 2026 · 7w ago

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

Xiaojun Jia, Jie Liao, Simeng Qin et al. · Nanyang Technological University · Chongqing University +4 more

Automated framework crafts stealthy skill-based prompt injections against LLM coding agents using closed-loop refinement agents

Prompt Injection Insecure Plugin Design nlp

PDF

attack arXiv Jan 29, 2026 · 9w ago

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

Xinwei Zhang, Hangcheng Liu, Li Bai et al. · The Hong Kong Polytechnic University · Nanyang Technological University +1 more

Proposes CAGE, a compression-aware adversarial attack exposing that token-compressed VLM robustness is systematically overestimated by standard attacks

Input Manipulation Attack visionmultimodal

PDF

benchmark arXiv Jan 26, 2026 · 10w ago

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

Dezhang Kong, Zhuxi Wu, Shiqi Liu et al. · Zhejiang University · National University of Malaysia +4 more

Benchmark revealing LLM web agents fail to detect disguised malicious URLs across 61K attack instances in 10 real-world scenarios

Prompt Injection nlp

PDF Code

defense IEEE Transactions on Image Pro... Jan 23, 2026 · 10w ago

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Qinkai Yu, Chong Zhang, Gaojie Jin et al. · University of Exeter · King Abdullah University of Science and Technology +6 more

Embeds backdoor-based watermarks in medical segmentation models to verify ownership under black-box API conditions

Model Theft vision

PDF Code

Annotating medical data for training AI models is often costly and limited due to the shortage of specialists with relevant clinical expertise. This challenge is further compounded by privacy and ethical concerns associated with sensitive patient information. As a result, well-trained medical segmentation models on private datasets constitute valuable intellectual property requiring robust protection mechanisms. Existing model protection techniques primarily focus on classification and generative tasks, while segmentation models-crucial to medical image analysis-remain largely underexplored. In this paper, we propose a novel, stealthy, and harmless method, StealthMark, for verifying the ownership of medical segmentation models under black-box conditions. Our approach subtly modulates model uncertainty without altering the final segmentation outputs, thereby preserving the model's performance. To enable ownership verification, we incorporate model-agnostic explanation methods, e.g. LIME, to extract feature attributions from the model outputs. Under specific triggering conditions, these explanations reveal a distinct and verifiable watermark. We further design the watermark as a QR code to facilitate robust and recognizable ownership claims. We conducted extensive experiments across four medical imaging datasets and five mainstream segmentation models. The results demonstrate the effectiveness, stealthiness, and harmlessness of our method on the original model's segmentation performance. For example, when applied to the SAM model, StealthMark consistently achieved ASR above 95% across various datasets while maintaining less than a 1% drop in Dice and AUC scores, significantly outperforming backdoor-based watermarking methods and highlighting its strong potential for practical deployment. Our implementation code is made available at: https://github.com/Qinkaiyu/StealthMark.

transformer cnn University of Exeter · King Abdullah University of Science and Technology · Xi’an Jiaotong-Liverpool University +5 more

PDF arXiv DOI Code

defense arXiv Jan 19, 2026 · 11w ago

Proxy Robustness in Vision Language Models is Effortlessly Transferable

Xiaowei Fu, Fuxiang Huang, Lei Zhang · Chongqing University · Lingnan University

Transfers adversarial robustness across heterogeneous CLIP variants via proxy distillation, boosting VLM defense without costly adversarial teacher training

Input Manipulation Attack visionmultimodal

PDF Code

attack arXiv Jan 19, 2026 · 11w ago

CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

Nay Myat Min, Long H. Pham, Hongyu Zhang et al. · Singapore Management University · Chongqing University

Attacks LLM hallucination detectors by fine-tuning LoRA adapters to camouflage internal uncertainty, hidden-state, and attention signals

Output Integrity Attack nlp

PDF

survey ICICML Jan 18, 2026 · 11w ago

Adversarial Defense in Vision-Language Models: An Overview

Xiaowei Fu, Lei Zhang · Chongqing University

Surveys three adversarial defense paradigms for VLMs—training-time, test-time adaptation, and training-free—highlighting tradeoffs and open challenges

Input Manipulation Attack visionnlpmultimodal

PDF

defense arXiv Jan 17, 2026 · 11w ago

RCDN: Real-Centered Detection Network for Robust Face Forgery Identification

Wyatt McCurdy, Xin Zhang, Yuqi Song et al. · University of Southern Maine · Chongqing University

Proposes real-centered CNN detector for face forgeries that generalizes across unseen AI-generation techniques including diffusion models

Output Integrity Attack vision

PDF

attack arXiv Jan 1, 2026 · Jan 2026

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Zongwei Wang, Bincheng Gu, Hongyu Yu et al. · Chongqing University · The University of Queensland +2 more

Belief Poisoning Attack corrupts LLM agent profiles and memory to make agents treat humans as outgroup, bypassing human-oriented safety behaviors

Prompt Injection Excessive Agency nlp

PDF Code

tool arXiv Dec 21, 2025 · Dec 2025

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Zhang Wei, Peilu Hu, Zhenyuan Wei et al. · Independent Researcher · Ltd. +12 more

Automated red-teaming tool for LLMs using meta-prompt-guided adversarial generation, finding 3.9× more vulnerabilities than manual testing

Prompt Injection nlp

1 citations PDF

attack arXiv Dec 11, 2025 · Dec 2025

The Eminence in Shadow: Exploiting Feature Boundary Ambiguity for Robust Backdoor Attacks

Zhou Feng, Jiahao Chen, Chunyi Zhou et al. · Zhejiang University · Chongqing University +1 more

Theoretically-grounded backdoor attack exploiting decision boundary ambiguity achieves >90% ASR at just 0.01% poison rate

Model Poisoning vision

PDF Code

benchmark arXiv Dec 6, 2025 · Dec 2025

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia, Jie Liao, Qi Guo et al. · Nanyang Technological University · BraneMatrix AI +7 more

Unified benchmark and toolbox evaluating 13 attack methods and 15 defenses against multimodal jailbreaks across 18 open- and closed-source MLLMs

Prompt Injection multimodalnlpvision

5 citations PDF Code

defense arXiv Nov 20, 2025 · Nov 2025

Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

Yuting Lu, Ziliang Wang, Weixin Xu et al. · Chongqing University · Peking University

Defends medical image segmentation against PGD/SSAH adversarial attacks via layer-wise noise-guided selective wavelet reconstruction

Input Manipulation Attack vision

PDF

defense International Journal of Compu... Nov 14, 2025 · Nov 2025

Unsupervised Robust Domain Adaptation: Paradigm, Theory and Algorithm

Fuxiang Huang, Xiaowei Fu, Shiyu Ye et al. · Chongqing University · Lingnan University +3 more

Defends unsupervised domain adaptation models against adversarial attacks via disentangled distillation post-training

Input Manipulation Attack vision

PDF

attack arXiv Sep 24, 2025 · Sep 2025

Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Zhifang Zhang, Jiahan Zhang, Shengjie Zhou et al. · Southeast University · Johns Hopkins University +3 more

Proposes Proxy Targeted Attack to craft generalizable, anomaly-evasive adversarial examples against multimodal encoders like ImageBind

Input Manipulation Attack visionmultimodalnlp

2 citations PDF

defense arXiv Sep 17, 2025 · Sep 2025

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

Zhaoyang Chu, Yao Wan, Zhikun Zhang et al. · Huazhong University of Science and Technology · Zhejiang University +4 more

Defends code LLMs against sensitive training data extraction by selectively unlearning memorized PII and credentials via gradient ascent

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Aug 30, 2025 · Aug 2025

FreeTalk:A plug-and-play and black-box defense against speech synthesis attacks

Yuwen Pu, Zhou Feng, Chunyi Zhou et al. · Chongqing University · Zhejiang University

Adds frequency-domain adversarial perturbations to audio in a black-box setting to prevent voice cloning by VC/TTS models

Input Manipulation Attack audio

PDF

tool arXiv Aug 18, 2025 · Aug 2025

Prompt-Induced Linguistic Fingerprints for LLM-Generated Fake News Detection

Chi Wang, Min Gao, Zongwei Wang et al. · Chongqing University · Emory University +1 more

Detects LLM-generated fake news by extracting prompt-induced linguistic fingerprints from reconstructed word-level probability distributions

Output Integrity Attack nlp

PDF Code

attack arXiv Aug 8, 2025 · Aug 2025

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Wenpeng Xing, Mohan Li, Chunqiang Hu et al. · Bingjiang Institute of Zhejiang University · Zhejiang University +3 more

White-box jailbreak fuses harmful and benign hidden states in latent space to bypass LLM safety alignment with 94% ASR

Input Manipulation Attack Prompt Injection nlp

PDF

Latest papers

Unveiling the Security Risks of Federated Learning in the Wild: From Research to Practice

SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

StealthMark: Harmless and Stealthy Ownership Verification for Medical Segmentation via Uncertainty-Guided Backdoors

Proxy Robustness in Vision Language Models is Effortlessly Transferable

CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

Adversarial Defense in Vision-Language Models: An Overview

RCDN: Real-Centered Detection Network for Robust Face Forgery Identification

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

The Eminence in Shadow: Exploiting Feature Boundary Ambiguity for Robust Backdoor Attacks

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Layer-wise Noise Guided Selective Wavelet Reconstruction for Robust Medical Image Segmentation

Unsupervised Robust Domain Adaptation: Paradigm, Theory and Algorithm

Improving Generalizability and Undetectability for Targeted Adversarial Attacks on Multimodal Pre-trained Models

Scrub It Out! Erasing Sensitive Memorization in Code Language Models via Machine Unlearning

FreeTalk:A plug-and-play and black-box defense against speech synthesis attacks

Prompt-Induced Linguistic Fingerprints for LLM-Generated Fake News Detection

Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue