ML Security Papers

Latest papers

33 papers

defense arXiv Apr 27, 2026 · 24d ago

LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization

Bokang Zeng, Zheng Gao, Xiaoyu Li et al. · UNSW Sydney · Griffith University

Audio-visual watermarking framework that detects and localizes deepfake tampering in videos while surviving compression and multimodal misalignment

Output Integrity Attack multimodalvisionaudio

PDF

attack arXiv Apr 14, 2026 · 5w ago

CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

Yongxuan Wu, Xixun Lin, He Zhang et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Black-box attack inferring LLM multi-agent system communication topologies via adversarial queries, achieving 99% peak AUC

Model Theft Excessive Agency nlp

PDF Code

attack arXiv Apr 3, 2026 · 6w ago

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

Yubin Qu, Yi Liu, Tongcheng Geng et al. · Griffith University · Quantstamp +6 more

Supply-chain attack embedding malicious payloads in LLM agent skill documentation, achieving up to 33.5% bypass of defenses

AI Supply Chain Attacks Insecure Plugin Design Excessive Agency nlp

PDF

benchmark arXiv Apr 3, 2026 · 6w ago

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

Zhihao Chen, Ying Zhang, Yi Liu et al. · Fujian Normal University · Wake Forest University +7 more

Large-scale analysis of 17K LLM agent skills finding 520 vulnerable to credential leakage via debug logging and prompt injection

AI Supply Chain Attacks Prompt Injection Insecure Plugin Design nlp

PDF

attack arXiv Mar 31, 2026 · 7w ago

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

Rui Bao, Zheng Gao, Xiaoyu Li et al. · University of New South Wales · Griffith University

Training-free attack that removes diffusion-based watermarks by deflecting generation trajectories, achieving 95-100% success across nine methods

Output Integrity Attack visiongenerative

PDF

attack arXiv Mar 18, 2026 · 9w ago

ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

Zirui Gong, Leo Yu Zhang, Yanjun Zhang et al. · Griffith University · Swinburne University of Technology +2 more

Gradient inversion attack reconstructing training data from federated learning updates via sparse activation recovery without architectural changes

Model Inversion Attack visionfederated-learning

PDF

attack arXiv Mar 17, 2026 · 9w ago

Poisoning the Pixels: Revisiting Backdoor Attacks on Semantic Segmentation

Guangsheng Zhang, Huan Tian, Leo Zhang et al. · University of Technology Sydney · Griffith University +2 more

Backdoor framework for semantic segmentation introducing six attack vectors and optimized triggers, bypassing existing defenses

Model Poisoning Data Poisoning Attack vision

PDF

defense arXiv Mar 13, 2026 · 9w ago

SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

Zheng Gao, Yifan Yang, Xiaoyu Li et al. · University of New South Wales · Griffith University

Fine-grained semantic watermarking for diffusion models that embeds tamper-detectable signals across four semantic factors in initial noise

Output Integrity Attack visiongenerative

PDF

attack arXiv Feb 25, 2026 · 12w ago

Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection

Zheng Gao, Xiaoyu Li, Zhicheng Bao et al. · University of New South Wales · Griffith University

LLM-guided semantic injection attack that bypasses content-aware watermarks in diffusion-generated images by preserving global coherence while invalidating watermark bindings

Output Integrity Attack visiongenerativenlp

PDF

attack arXiv Feb 11, 2026 · Feb 2026

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Shuyu Chang, Haiping Huang, Yanjun Zhang et al. · Nanjing University of Posts and Telecommunications · State Key Laboratory of Tibetan Intelligence +5 more

Backdoor attack on code models using sharpness-aware training and Gumbel-Softmax triggers for cross-dataset transferability and stealthiness

Model Poisoning nlp

PDF

Code models are increasingly adopted in software development but remain vulnerable to backdoor attacks via poisoned training data. Existing backdoor attacks on code models face a fundamental trade-off between transferability and stealthiness. Static trigger-based attacks insert fixed dead code patterns that transfer well across models and datasets but are easily detected by code-specific defenses. In contrast, dynamic trigger-based attacks adaptively generate context-aware triggers to evade detection but suffer from poor cross-dataset transferability. Moreover, they rely on unrealistic assumptions of identical data distributions between poisoned and victim training data, limiting their practicality. To overcome these limitations, we propose Sharpness-aware Transferable Adversarial Backdoor (STAB), a novel attack that achieves both transferability and stealthiness without requiring complete victim data. STAB is motivated by the observation that adversarial perturbations in flat regions of the loss landscape transfer more effectively across datasets than those in sharp minima. To this end, we train a surrogate model using Sharpness-Aware Minimization to guide model parameters toward flat loss regions, and employ Gumbel-Softmax optimization to enable differentiable search over discrete trigger tokens for generating context-aware adversarial triggers. Experiments across three datasets and two code models show that STAB outperforms prior attacks in terms of transferability and stealthiness. It achieves a 73.2% average attack success rate after defense, outperforming static trigger-based attacks that fail under defense. STAB also surpasses the best dynamic trigger-based attack by 12.4% in cross-dataset attack success rate and maintains performance on clean inputs.

transformer Nanjing University of Posts and Telecommunications · State Key Laboratory of Tibetan Intelligence · Jiangsu Provincial Key Laboratory of Internet of Things Intelligent Perception and Computing +4 more

PDF arXiv DOI

benchmark arXiv Feb 6, 2026 · Feb 2026

Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study

Yi Liu, Zhihao Chen, Yanjun Zhang et al. · Quantstamp · Fujian Normal University +4 more

Empirical study of 98,380 LLM agent skills finds 157 malicious ones using supply chain theft and instruction hijacking

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection nlp

2 citations 1 influentialPDF

attack arXiv Feb 2, 2026 · Feb 2026

Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

Bohan Wang, Zewen Liu, Lu Lin et al. · Emory University · The Pennsylvania State University +2 more

Adversarially decouples time series classifier predictions from explanations, enabling targeted misclassification with plausible-looking cover-up explanations

Input Manipulation Attack timeseries

PDF

defense arXiv Jan 28, 2026 · Jan 2026

UnlearnShield: Shielding Forgotten Privacy against Unlearning Inversion

Lulu Xue, Shengshan Hu, Wei Lu et al. · Huazhong University of Science and Technology · Institute of Guizhou Aerospace Measuring and Testing Technology +2 more

Defends machine unlearning against inversion attacks that reconstruct erased training data via cosine-space perturbations

Model Inversion Attack vision

PDF

attack arXiv Jan 21, 2026 · Jan 2026

Beyond Denial-of-Service: The Puppeteer's Attack for Fine-Grained Control in Ranking-Based Federated Learning

Zhihao Chen, Zirui Gong, Jianting Ning et al. · Fujian Normal University · Griffith University

Novel federated poisoning attack precisely degrades global model accuracy to any target level while evading Byzantine-robust aggregation defenses

Data Poisoning Attack federated-learning

PDF Code

defense arXiv Jan 21, 2026 · Jan 2026

Erosion Attack for Adversarial Training to Enhance Semantic Segmentation Robustness

Yufei Song, Ziqi Zhou, Menghao Deng et al. · Huazhong University of Science and Technology · National University of Singapore +1 more

Proposes erosion-based adversarial attack on segmentation models that propagates perturbations from low- to high-confidence pixels, used to strengthen adversarial training robustness

Input Manipulation Attack vision

PDF

attack arXiv Jan 17, 2026 · Jan 2026

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

Jun Liu, Leo Yu Zhang, Fengpeng Li et al. · University of Macau · National Institute of Informatics +2 more

Hard-label black-box adversarial attack using frequency-domain initialization and pattern-driven optimization to recover gradient sign information

Input Manipulation Attack vision

PDF Code

Hard-label black-box settings, where only top-1 predicted labels are observable, pose a fundamentally constrained yet practically important feedback model for understanding model behavior. A central challenge in this regime is whether meaningful gradient information can be recovered from such discrete responses. In this work, we develop a unified theoretical perspective showing that a wide range of existing sign-flipping hard-label attacks can be interpreted as implicitly approximating the sign of the true loss gradient. This observation reframes hard-label attacks from heuristic search procedures into instances of gradient sign recovery under extremely limited feedback. Motivated by this first-principles understanding, we propose a new attack framework that combines a zero-query frequency-domain initialization with a Pattern-Driven Optimization (PDO) strategy. We establish theoretical guarantees demonstrating that, under mild assumptions, our initialization achieves higher expected cosine similarity to the true gradient sign compared to random baselines, while the proposed PDO procedure attains substantially lower query complexity than existing structured search approaches. We empirically validate our framework through extensive experiments on CIFAR-10, ImageNet, and ObjectNet, covering standard and adversarially trained models, commercial APIs, and CLIP-based models. The results show that our method consistently surpasses SOTA hard-label attacks in both attack success rate and query efficiency, particularly in low-query regimes. Beyond image classification, our approach generalizes effectively to corrupted data, biomedical datasets, and dense prediction tasks. Notably, it also successfully circumvents Blacklight, a SOTA stateful defense, resulting in a $0\%$ detection rate. Our code will be released publicly soon at https://github.com/csjunjun/DPAttack.git.

cnn transformer University of Macau · National Institute of Informatics · Griffith University +1 more

PDF arXiv DOI Code

attack arXiv Jan 17, 2026 · Jan 2026

Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models

Xiaomei Zhang, Zhaoxi Zhang, Leo Yu Zhang et al. · Griffith University · University of Technology Sydney +1 more

Adversarial attack exploits visual token compression in VLMs by perturbing token importance rankings, causing failures only under compressed inference

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

tool arXiv Jan 15, 2026 · Jan 2026

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng et al. · Nanyang Technological University · Tianjin University +4 more

Scans 31K AI agent skills from marketplaces, finding 26% contain vulnerabilities including prompt injection, data exfiltration, and supply chain risks

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection nlp

8 citations 2 influentialPDF

defense arXiv Dec 21, 2025 · Dec 2025

Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection

Junjun Pan, Yixin Liu, Rui Miao et al. · Griffith University · Jilin University +1 more

Defends LLM multi-agent systems by detecting malicious agents using bi-level graph anomaly detection with token-level explainability

Excessive Agency nlpgraph

1 citations PDF

attack arXiv Dec 18, 2025 · Dec 2025

Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure

Lulu Xue, Shengshan Hu, Linqiang Qian et al. · Huazhong University of Science and Technology · Tsinghua University +4 more

Novel black-box MIA exploits dual-model access after unlearning to infer membership of retained data via likelihood ratio inference

Membership Inference Attack vision

2 citations PDF

Loading more papers…

Latest papers

LAVA: Layered Audio-Visual Anti-tampering Watermarking for Robust Deepfake Detection and Localization

CIA: Inferring the Communication Topology from LLM-based Multi-Agent Systems

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

ARES: Scalable and Practical Gradient Inversion Attack in Federated Learning through Activation Recovery

Poisoning the Pixels: Revisiting Backdoor Attacks on Semantic Segmentation

SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study

Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

UnlearnShield: Shielding Forgotten Privacy against Unlearning Inversion

Beyond Denial-of-Service: The Puppeteer's Attack for Fine-Grained Control in Ranking-Based Federated Learning

Erosion Attack for Adversarial Training to Enhance Semantic Segmentation Robustness

Gradient Structure Estimation under Label-Only Oracles via Spectral Sensitivity

Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection

Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue