ML Security Papers

Latest papers

20 papers

attack arXiv Mar 17, 2026 · 22d ago

Poisoning the Pixels: Revisiting Backdoor Attacks on Semantic Segmentation

Guangsheng Zhang, Huan Tian, Leo Zhang et al. · University of Technology Sydney · Griffith University +2 more

Backdoor framework for semantic segmentation introducing six attack vectors and optimized triggers, bypassing existing defenses

Model Poisoning Data Poisoning Attack vision

PDF

defense arXiv Mar 13, 2026 · 26d ago

Why Neural Structural Obfuscation Can't Kill White-Box Watermarks for Good!

Yanna Jiang, Guangsheng Yu, Qingyuan Yu et al. · University of Technology Sydney · Independent +2 more

Defeats Neural Structural Obfuscation attacks on model watermarks by canonicalizing neural networks to restore watermark verification

Model Theft vision

PDF Code

defense arXiv Mar 9, 2026 · 4w ago

Client-Cooperative Split Learning

Haiyu Deng, Yanna Jiang, Guangsheng Yu et al. · University of Technology Sydney · CSIRO Data61 +1 more

Defends split learning against activation inversion, label clustering, and model extraction via DP and chained watermarking

Model Inversion Attack Model Theft federated-learningvision

PDF

attack arXiv Mar 1, 2026 · 5w ago

Turning Black Box into White Box: Dataset Distillation Leaks

Huajie Chen, Tianqing Zhu, Yuchen Zhong et al. · City University of Macau · CISPA Helmholtz Center for Information Security +2 more

Reveals that dataset distillation leaks training data via three-stage attack: architecture inference, membership inference, and model inversion

Model Inversion Attack Membership Inference Attack vision

PDF

attack arXiv Feb 28, 2026 · 5w ago

Learning to Attack: A Bandit Approach to Adversarial Context Poisoning

Ray Telikani, Amir H. Gandomi · University of Technology Sydney

Black-box context poisoning attack on neural contextual bandits via inverse RL surrogate modeling and GP-UCB-guided PGD perturbations

Input Manipulation Attack reinforcement-learning

PDF

survey arXiv Feb 24, 2026 · 6w ago

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Yanna Jiang, Delong Li, Haiyu Deng et al. · University of Technology Sydney · CSIRO

Surveys LLM agentic skill security covering marketplace supply-chain attacks, prompt injection via skill payloads, and trust-tiered execution

AI Supply Chain Attacks Prompt Injection Insecure Plugin Design nlpreinforcement-learning

PDF

attack arXiv Feb 11, 2026 · 8w ago

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Shuyu Chang, Haiping Huang, Yanjun Zhang et al. · Nanjing University of Posts and Telecommunications · State Key Laboratory of Tibetan Intelligence +5 more

Backdoor attack on code models using sharpness-aware training and Gumbel-Softmax triggers for cross-dataset transferability and stealthiness

Model Poisoning nlp

PDF

Code models are increasingly adopted in software development but remain vulnerable to backdoor attacks via poisoned training data. Existing backdoor attacks on code models face a fundamental trade-off between transferability and stealthiness. Static trigger-based attacks insert fixed dead code patterns that transfer well across models and datasets but are easily detected by code-specific defenses. In contrast, dynamic trigger-based attacks adaptively generate context-aware triggers to evade detection but suffer from poor cross-dataset transferability. Moreover, they rely on unrealistic assumptions of identical data distributions between poisoned and victim training data, limiting their practicality. To overcome these limitations, we propose Sharpness-aware Transferable Adversarial Backdoor (STAB), a novel attack that achieves both transferability and stealthiness without requiring complete victim data. STAB is motivated by the observation that adversarial perturbations in flat regions of the loss landscape transfer more effectively across datasets than those in sharp minima. To this end, we train a surrogate model using Sharpness-Aware Minimization to guide model parameters toward flat loss regions, and employ Gumbel-Softmax optimization to enable differentiable search over discrete trigger tokens for generating context-aware adversarial triggers. Experiments across three datasets and two code models show that STAB outperforms prior attacks in terms of transferability and stealthiness. It achieves a 73.2% average attack success rate after defense, outperforming static trigger-based attacks that fail under defense. STAB also surpasses the best dynamic trigger-based attack by 12.4% in cross-dataset attack success rate and maintains performance on clean inputs.

transformer Nanjing University of Posts and Telecommunications · State Key Laboratory of Tibetan Intelligence · Jiangsu Provincial Key Laboratory of Internet of Things Intelligent Perception and Computing +4 more

PDF arXiv DOI

defense arXiv Jan 28, 2026 · 10w ago

UnlearnShield: Shielding Forgotten Privacy against Unlearning Inversion

Lulu Xue, Shengshan Hu, Wei Lu et al. · Huazhong University of Science and Technology · Institute of Guizhou Aerospace Measuring and Testing Technology +2 more

Defends machine unlearning against inversion attacks that reconstruct erased training data via cosine-space perturbations

Model Inversion Attack vision

PDF

attack arXiv Jan 17, 2026 · 11w ago

Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models

Xiaomei Zhang, Zhaoxi Zhang, Leo Yu Zhang et al. · Griffith University · University of Technology Sydney +1 more

Adversarial attack exploits visual token compression in VLMs by perturbing token importance rankings, causing failures only under compressed inference

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

attack arXiv Dec 18, 2025 · Dec 2025

Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure

Lulu Xue, Shengshan Hu, Linqiang Qian et al. · Huazhong University of Science and Technology · Tsinghua University +4 more

Novel black-box MIA exploits dual-model access after unlearning to infer membership of retained data via likelihood ratio inference

Membership Inference Attack vision

2 citations PDF

benchmark arXiv Dec 16, 2025 · Dec 2025

Black-Box Auditing of Quantum Model: Lifted Differential Privacy with Quantum Canaries

Baobao Song, Shiva Raj Pokhrel, Athanasios V. Vasilakos et al. · University of Technology Sydney · Deakin University +2 more

Black-box canary framework audits quantum ML models for memorization, empirically lower-bounding privacy leakage via quantum differential privacy

Membership Inference Attack

PDF

defense arXiv Dec 8, 2025 · Dec 2025

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

Ziming Hong, Tianyu Huang, Runnan Chen et al. · The University of Sydney · University of Technology Sydney +3 more

Defends 3D Gaussian Splatting assets from AI editing by lifting adversarial perturbations from 2D image space into 3D Gaussian parameters

Input Manipulation Attack visiongenerative

4 citations PDF Code

defense arXiv Nov 24, 2025 · Nov 2025

SpectraNet: FFT-assisted Deep Learning Classifier for Deepfake Face Detection

Nithira Jayarathne, Naveen Basnayake, Keshawa Jayasundara et al. · University of Moratuwa · University of Technology Sydney

Proposes EfficientNet-B6 + FFT hybrid detector for deepfake faces, achieving 91% accuracy with balanced batch training

Output Integrity Attack vision

PDF

defense arXiv Nov 21, 2025 · Nov 2025

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

Yuqi Li, Junhao Dong, Chuanguang Yang et al. · Nanyang Technological University · Institute of Computing Technology +4 more

Defends VLMs against adversarial examples via dual multi-teacher distillation, gaining +4.32% robust accuracy with 2.3x training speedup

Input Manipulation Attack visionmultimodal

2 citations PDF Code

benchmark arXiv Oct 18, 2025 · Oct 2025

Scaling Laws for Deepfake Detection

Wenhao Wang, Longqi Cai, Taihong Xiao et al. · University of Technology Sydney · Google DeepMind

Discovers power-law scaling laws for deepfake detection using ScaleDF, the largest dataset with 14M+ images across 51 real domains and 102 generation methods

Output Integrity Attack visiongenerative

1 citations PDF

attack NDSS Sep 11, 2025 · Sep 2025

Character-Level Perturbations Disrupt LLM Watermarks

Zhaoxi Zhang, Xiaomei Zhang, Yanjun Zhang et al. · University of Technology Sydney · Griffith University +1 more

Attacks LLM text watermarks via character-level perturbations that disrupt tokenization, defeating five watermarking schemes with minimal detector access

Output Integrity Attack nlp

PDF

Large Language Model (LLM) watermarking embeds detectable signals into generated text for copyright protection, misuse prevention, and content detection. While prior studies evaluate robustness using watermark removal attacks, these methods are often suboptimal, creating the misconception that effective removal requires large perturbations or powerful adversaries. To bridge the gap, we first formalize the system model for LLM watermark, and characterize two realistic threat models constrained on limited access to the watermark detector. We then analyze how different types of perturbation vary in their attack range, i.e., the number of tokens they can affect with a single edit. We observe that character-level perturbations (e.g., typos, swaps, deletions, homoglyphs) can influence multiple tokens simultaneously by disrupting the tokenization process. We demonstrate that character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model. We further propose guided removal attacks based on the Genetic Algorithm (GA) that uses a reference detector for optimization. Under a practical threat model with limited black-box queries to the watermark detector, our method demonstrates strong removal performance. Experiments confirm the superiority of character-level perturbations and the effectiveness of the GA in removing watermarks under realistic constraints. Additionally, we argue there is an adversarial dilemma when considering potential defenses: any fixed defense can be bypassed by a suitable perturbation strategy. Motivated by this principle, we propose an adaptive compound character-level attack. Experimental results show that this approach can effectively defeat the defenses. Our findings highlight significant vulnerabilities in existing LLM watermark schemes and underline the urgency for the development of new robust mechanisms.

llm transformer University of Technology Sydney · Griffith University · RMIT University

PDF arXiv DOI

defense arXiv Sep 2, 2025 · Sep 2025

Privacy-Utility Trade-off in Data Publication: A Bilevel Optimization Framework with Curvature-Guided Perturbation

Yi Yin, Guangquan Zhang, Hua Zuo et al. · University of Technology Sydney

Bilevel optimization framework that perturbs training data via manifold curvature guidance to defend against membership inference attacks while preserving downstream utility

Membership Inference Attack visiongenerative

PDF

defense arXiv Aug 6, 2025 · Aug 2025

Isolate Trigger: Detecting and Eliminating Adaptive Backdoor Attacks

Chengrui Sun, Hua Zhang, Haoran Gao et al. · Beijing University of Posts and Telecommunications · China Mobile Research Institute +2 more

Defends against adaptive backdoor attacks by isolating hidden triggers from benign features and applying unlearning-based model repair

Model Poisoning vision

PDF

defense arXiv Aug 3, 2025 · Aug 2025

MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection

Kuo Shi, Jie Lu, Shanshan Ye et al. · University of Technology Sydney

Proposes CLIP-based discriminative representation learning to detect AI-generated images generalizing to unseen generators like Sora

Output Integrity Attack visionmultimodal

PDF

survey arXiv Jan 2, 2025 · Jan 2025

State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects

Harshika Goyal, Mohammad Saif Wajid, Mohd Anas Wajid et al. · Indian Institute of Technology · Tecnológico de Monterrey +6 more

Surveys ~400 papers on deepfake generation (GANs, VAEs, Transformers) and detection, benchmarking datasets and future challenges

Output Integrity Attack visiongenerative

5 citations PDF

The rapid advancement of deepfake technologies, specifically designed to create incredibly lifelike facial imagery and video content, has ignited a remarkable level of interest and curiosity across many fields, including forensic analysis, cybersecurity and the innovative creation of digital characters. By harnessing the latest breakthroughs in deep learning methods, such as Generative Adversarial Networks, Variational Autoencoders, Few-Shot Learning Strategies, and Transformers, the outcomes achieved in generating deepfakes have been nothing short of astounding and transformative. Also, the ongoing evolution of detection technologies is being developed to counteract the potential for misuse associated with deepfakes, effectively addressing critical concerns that range from political manipulation to the dissemination of fake news and the ever-growing issue of cyberbullying. This comprehensive review paper meticulously investigates the most recent developments in deepfake generation and detection, including around 400 publications, providing an in-depth analysis of the cutting-edge innovations shaping this rapidly evolving landscape. Starting with a thorough examination of systematic literature review methodologies, we embark on a journey that delves into the complex technical intricacies inherent in the various techniques used for deepfake generation, comprehensively addressing the challenges faced, potential solutions available, and the nuanced details surrounding manipulation formulations. Subsequently, the paper is dedicated to accurately benchmarking leading approaches against prominent datasets, offering thorough assessments of the contributions that have significantly impacted these vital domains. Ultimately, we engage in a thoughtful discussion of the existing challenges, paving the way for continuous advancements in this critical and ever-dynamic study area.

gan diffusion transformer rnn Indian Institute of Technology · Tecnológico de Monterrey · TEC de Monterrey +5 more

PDF arXiv DOI

Latest papers

Poisoning the Pixels: Revisiting Backdoor Attacks on Semantic Segmentation

Why Neural Structural Obfuscation Can't Kill White-Box Watermarks for Good!

Client-Cooperative Split Learning

Turning Black Box into White Box: Dataset Distillation Leaks

Learning to Attack: A Bandit Approach to Adversarial Context Poisoning

SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

UnlearnShield: Shielding Forgotten Privacy against Unlearning Inversion

Less Is More -- Until It Breaks: Security Pitfalls of Vision Token Compression in Large Vision-Language Models

Dual-View Inference Attack: Machine Unlearning Amplifies Privacy Exposure

Black-Box Auditing of Quantum Model: Lifted Differential Privacy with Quantum Canaries

AdLift: Lifting Adversarial Perturbations to Safeguard 3D Gaussian Splatting Assets Against Instruction-Driven Editing

SpectraNet: FFT-assisted Deep Learning Classifier for Deepfake Face Detection

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

Scaling Laws for Deepfake Detection

Character-Level Perturbations Disrupt LLM Watermarks

Privacy-Utility Trade-off in Data Publication: A Bilevel Optimization Framework with Curvature-Guided Perturbation

Isolate Trigger: Detecting and Eliminating Adaptive Backdoor Attacks

MiraGe: Multimodal Discriminative Representation Learning for Generalizable AI-Generated Image Detection

State-of-the-art AI-based Learning Approaches for Deepfake Generation and Detection, Analyzing Opportunities, Threading through Pros, Cons, and Future Prospects

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue