ML Security Papers

Latest papers

27 papers

defense arXiv Apr 24, 2026 · 27d ago

Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets

Yuan Xiao, Jiaming Wang, Yuchen Chen et al. · Nanjing University · University of New South Wales +3 more

Dataset poisoning defense that injects compilable, functionality-preserving code fragments to degrade CodeLLM training with only 10% contamination

Data Poisoning Attack Training Data Poisoning nlp

PDF

attack arXiv Apr 21, 2026 · 4w ago

If you're waiting for a sign... that might not be it! Mitigating Trust Boundary Confusion from Visual Injections on Vision-Language Agentic Systems

Jiamin Chang, Minhui Xue, Ruoxi Sun et al. · University of New South Wales · CSIRO's Data61 +1 more

Visual injection attacks on VLM agents that exploit trust boundary confusion between legitimate environmental cues and malicious visual prompts

Input Manipulation Attack Prompt Injection Excessive Agency visionmultimodal

PDF Code

defense arXiv Apr 20, 2026 · 4w ago

Hierarchically Robust Zero-shot Vision-language Models

Junhao Dong, Yifei Zhang, Hao Zhu et al. · Nanyang Technological University · A*STAR +3 more

Hierarchical adversarial fine-tuning for VLMs using hyperbolic embeddings to defend against attacks on both base and superclasses

Input Manipulation Attack visionmultimodal

PDF

attack arXiv Apr 3, 2026 · 6w ago

Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

Yubin Qu, Yi Liu, Tongcheng Geng et al. · Griffith University · Quantstamp +6 more

Supply-chain attack embedding malicious payloads in LLM agent skill documentation, achieving up to 33.5% bypass of defenses

AI Supply Chain Attacks Insecure Plugin Design Excessive Agency nlp

PDF

benchmark arXiv Apr 3, 2026 · 6w ago

Credential Leakage in LLM Agent Skills: A Large-Scale Empirical Study

Zhihao Chen, Ying Zhang, Yi Liu et al. · Fujian Normal University · Wake Forest University +7 more

Large-scale analysis of 17K LLM agent skills finding 520 vulnerable to credential leakage via debug logging and prompt injection

AI Supply Chain Attacks Prompt Injection Insecure Plugin Design nlp

PDF

attack arXiv Mar 31, 2026 · 7w ago

SHIFT: Stochastic Hidden-Trajectory Deflection for Removing Diffusion-based Watermark

Rui Bao, Zheng Gao, Xiaoyu Li et al. · University of New South Wales · Griffith University

Training-free attack that removes diffusion-based watermarks by deflecting generation trajectories, achieving 95-100% success across nine methods

Output Integrity Attack visiongenerative

PDF

defense arXiv Mar 25, 2026 · 8w ago

Enhancing and Reporting Robustness Boundary of Neural Code Models for Intelligent Code Understanding

Tingxu Han, Wei Song, Weisong Sun et al. · Nanjing University · University of New South Wales +2 more

Black-box certified defense for code models using randomized smoothing to reduce adversarial attack success from 42% to 9.74%

Input Manipulation Attack nlp

PDF

With the development of deep learning, Neural Code Models (NCMs) such as CodeBERT and CodeLlama are widely used for code understanding tasks, including defect detection and code classification. However, recent studies have revealed that NCMs are vulnerable to adversarial examples, inputs with subtle perturbations that induce incorrect predictions while remaining difficult to detect. Existing defenses address this issue via data augmentation to empirically improve robustness, but they are costly, offer no theoretical robustness guarantees, and typically require white-box access to model internals, such as gradients. To address the above challenges, we propose ENBECOME, a novel black-box training-free and lightweight adversarial defense. ENBECOME is designed to both enhance empirical robustness and report certified robustness boundaries for NCMs. ENBECOME operates solely during inference, introducing random, semantics-preserving perturbations to input code snippets to smooth the NCM's decision boundaries. This smoothing enables ENBECOME to formally certify a robustness radius within which adversarial examples can never induce misclassification, a property known as certified robustness. We conduct comprehensive experiments across multiple NCM architectures and tasks. Results show that ENBECOME significantly reduces attack success rates while maintaining high accuracy. For example, in defect detection, it reduces the average ASR from 42.43% to 9.74% with only a 0.29% drop in accuracy. Results show that ENBECOME significantly reduces attack success rates while maintaining high accuracy. For example, in defect detection, it reduces the average ASR from 42.43% to 9.74% with only a 0.29% drop in accuracy. Furthermore, ENBECOME achieves an average certified robustness radius of 1.63, meaning that adversarial modifications to no more than 1.63 identifiers are provably ineffective.

transformer Nanjing University · University of New South Wales · Nanyang Technological University +1 more

PDF arXiv

defense arXiv Mar 13, 2026 · 9w ago

SLICE: Semantic Latent Injection via Compartmentalized Embedding for Image Watermarking

Zheng Gao, Yifan Yang, Xiaoyu Li et al. · University of New South Wales · Griffith University

Fine-grained semantic watermarking for diffusion models that embeds tamper-detectable signals across four semantic factors in initial noise

Output Integrity Attack visiongenerative

PDF

attack arXiv Feb 25, 2026 · 12w ago

Breaking Semantic-Aware Watermarks via LLM-Guided Coherence-Preserving Semantic Injection

Zheng Gao, Xiaoyu Li, Zhicheng Bao et al. · University of New South Wales · Griffith University

LLM-guided semantic injection attack that bypasses content-aware watermarks in diffusion-generated images by preserving global coherence while invalidating watermark bindings

Output Integrity Attack visiongenerativenlp

PDF

tool arXiv Feb 21, 2026 · 12w ago

FOCA: Frequency-Oriented Cross-Domain Forgery Detection, Localization and Explanation via Multi-Modal Large Language Model

Zhou Liu, Tonghua Su, Hongshi Zhang et al. · Harbin Institute of Technology · DZ-Matrix +3 more

Multimodal LLM system detects and localizes AI-generated image forgeries by fusing RGB and frequency-domain forensic features

Output Integrity Attack visionmultimodal

PDF

benchmark arXiv Feb 6, 2026 · Feb 2026

Malicious Agent Skills in the Wild: A Large-Scale Security Empirical Study

Yi Liu, Zhihao Chen, Yanjun Zhang et al. · Quantstamp · Fujian Normal University +4 more

Empirical study of 98,380 LLM agent skills finds 157 malicious ones using supply chain theft and instruction hijacking

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection nlp

2 citations 1 influentialPDF

defense arXiv Feb 4, 2026 · Feb 2026

SIDeR: Semantic Identity Decoupling for Unrestricted Face Privacy

Zhuosen Bao, Xia Du, Zheng Lin et al. · Xiamen University of Technology · University of Hong Kong +8 more

Generates unrestricted adversarial faces using diffusion models to evade facial recognition with 99% black-box success rate

Input Manipulation Attack visiongenerative

PDF

survey arXiv Feb 2, 2026 · Feb 2026

Human Society-Inspired Approaches to Agentic AI Security: The 4C Framework

Alsharif Abuadbba, Nazatul Sultan, Surya Nepal et al. · CSIRO's Data61 · University of New South Wales

Proposes the 4C Framework to systematically organize and govern agentic AI security risks across Core, Connection, Cognition, and Compliance dimensions

Excessive Agency Prompt Injection Insecure Plugin Design nlp

PDF

defense arXiv Jan 27, 2026 · Jan 2026

SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks

Nirhoshan Sivaroopan, Kanchana Thilakarathna, Albert Zomaya et al. · University of New South Wales · University of Wollongong

Multi-agent auto-healing defense framework that detects and adapts to sponge attacks exhausting LLM compute resources

Model Denial of Service nlp

PDF

attack arXiv Jan 19, 2026 · Jan 2026

DUAP: Dual-task Universal Adversarial Perturbations Against Voice Control Systems

Suyang Sun, Weifei Jin, Yuxin Cao et al. · Beijing University of Posts and Telecommunications · National University of Singapore +1 more

Universal adversarial audio perturbations that simultaneously fool ASR transcription and speaker recognition in voice control systems

Input Manipulation Attack audio

PDF Code

tool arXiv Jan 15, 2026 · Jan 2026

Agent Skills in the Wild: An Empirical Study of Security Vulnerabilities at Scale

Yi Liu, Weizhe Wang, Ruitao Feng et al. · Nanyang Technological University · Tianjin University +4 more

Scans 31K AI agent skills from marketplaces, finding 26% contain vulnerabilities including prompt injection, data exfiltration, and supply chain risks

AI Supply Chain Attacks Insecure Plugin Design Prompt Injection nlp

8 citations 2 influentialPDF

defense arXiv Jan 5, 2026 · Jan 2026

FMVP: Masked Flow Matching for Adversarial Video Purification

Duoxun Tang, Xueyi Zhang, Chak Hin Wang et al. · Tsinghua University · The Chinese University of Hong Kong +2 more

Defends video recognition models against PGD and CW attacks via flow-matching purification with masking and frequency-gated loss

Input Manipulation Attack vision

PDF

benchmark arXiv Dec 29, 2025 · Dec 2025

Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark

Manu, Yi Guo, Kanchana Thilakarathna et al. · The University of Sydney · University of New South Wales +1 more

Benchmarks black-box LLM DoS attacks using evolutionary and RL-based prompt search to suppress EOS and inflate output length

Model Denial of Service nlp

1 citations 1 influentialPDF

attack TrustCom Nov 17, 2025 · Nov 2025

ForgeDAN: An Evolutionary Framework for Jailbreaking Aligned Large Language Models

Siyang Cheng, Gaotian Liu, Rui Mei et al. · iFLYTEK · Anhui SparkShield Intelligent Technology +5 more

Evolutionary jailbreak framework using multi-level text perturbations and semantic fitness to bypass LLM alignment at high success rates

Prompt Injection nlp

PDF

attack arXiv Sep 25, 2025 · Sep 2025

Poisoning Prompt-Guided Sampling in Video Large Language Models

Yuxin Cao, Wei Song, Jingling Xue et al. · National University of Singapore · University of New South Wales +1 more

Black-box adversarial perturbation attack suppresses harmful frame selection in VideoLLM prompt-guided sampling, achieving 82–99% success

Input Manipulation Attack Prompt Injection visionnlpmultimodal

1 citations PDF

Loading more papers…