ML Security Papers

Latest papers

33 papers

survey arXiv Apr 30, 2026 · 21d ago

Security Attack and Defense Strategies for Autonomous Agent Frameworks: A Layered Review with OpenClaw as a Case Study

Luyao Xu, Xiang Chen · Nantong University · Nanjing University

Layered security review of LLM agent frameworks covering prompt injection, tool misuse, state persistence attacks, and ecosystem vulnerabilities

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Apr 30, 2026 · 21d ago

PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

Haocheng Huang, Yuchen Chen, Weisong Sun et al. · Soochow University · Nanjing University +1 more

Dataset watermarking scheme embedding stealth marks in code via variable name patterns to prove training data ownership

Output Integrity Attack nlp

PDF

attack arXiv Apr 30, 2026 · 21d ago

Secret Stealing Attacks on Local LLM Fine-Tuning through Supply-Chain Model Code Backdoors

Zi Li, Tian Zhou, Wenze Li et al. · Nanjing University

Malicious model code backdoors that hijack fine-tuning to force memorization and extraction of high-entropy secrets like API keys

AI Supply Chain Attacks Model Inversion Attack Model Poisoning Sensitive Information Disclosure nlp

PDF

defense arXiv Apr 24, 2026 · 27d ago

Train in Vain: Functionality-Preserving Poisoning to Prevent Unauthorized Use of Code Datasets

Yuan Xiao, Jiaming Wang, Yuchen Chen et al. · Nanjing University · University of New South Wales +3 more

Dataset poisoning defense that injects compilable, functionality-preserving code fragments to degrade CodeLLM training with only 10% contamination

Data Poisoning Attack Training Data Poisoning nlp

PDF

defense arXiv Apr 12, 2026 · 5w ago

DuCodeMark: Dual-Purpose Code Dataset Watermarking via Style-Aware Watermark-Poison Design

Yuchen Chen, Yuan Xiao, Chunrong Fang et al. · Nanjing University

Embeds ownership watermarks in code training datasets using AST-based style triggers plus poisoned samples that degrade model performance if watermark is removed

Output Integrity Attack Model Poisoning nlp

PDF

defense arXiv Apr 1, 2026 · 7w ago

Shapley-Guided Neural Repair Approach via Derivative-Free Optimization

Xinyu Sun, Wanwei Liu, Haoang Chi et al. · National University of Defense Technology · Nanjing University +1 more

Interpretable DNN repair using Shapley-guided fault localization and derivative-free optimization for backdoor removal, adversarial defense, and fairness

Input Manipulation Attack Model Poisoning vision

PDF

defense arXiv Mar 25, 2026 · 8w ago

Enhancing and Reporting Robustness Boundary of Neural Code Models for Intelligent Code Understanding

Tingxu Han, Wei Song, Weisong Sun et al. · Nanjing University · University of New South Wales +2 more

Black-box certified defense for code models using randomized smoothing to reduce adversarial attack success from 42% to 9.74%

Input Manipulation Attack nlp

PDF

With the development of deep learning, Neural Code Models (NCMs) such as CodeBERT and CodeLlama are widely used for code understanding tasks, including defect detection and code classification. However, recent studies have revealed that NCMs are vulnerable to adversarial examples, inputs with subtle perturbations that induce incorrect predictions while remaining difficult to detect. Existing defenses address this issue via data augmentation to empirically improve robustness, but they are costly, offer no theoretical robustness guarantees, and typically require white-box access to model internals, such as gradients. To address the above challenges, we propose ENBECOME, a novel black-box training-free and lightweight adversarial defense. ENBECOME is designed to both enhance empirical robustness and report certified robustness boundaries for NCMs. ENBECOME operates solely during inference, introducing random, semantics-preserving perturbations to input code snippets to smooth the NCM's decision boundaries. This smoothing enables ENBECOME to formally certify a robustness radius within which adversarial examples can never induce misclassification, a property known as certified robustness. We conduct comprehensive experiments across multiple NCM architectures and tasks. Results show that ENBECOME significantly reduces attack success rates while maintaining high accuracy. For example, in defect detection, it reduces the average ASR from 42.43% to 9.74% with only a 0.29% drop in accuracy. Results show that ENBECOME significantly reduces attack success rates while maintaining high accuracy. For example, in defect detection, it reduces the average ASR from 42.43% to 9.74% with only a 0.29% drop in accuracy. Furthermore, ENBECOME achieves an average certified robustness radius of 1.63, meaning that adversarial modifications to no more than 1.63 identifiers are provably ineffective.

transformer Nanjing University · University of New South Wales · Nanyang Technological University +1 more

PDF arXiv

defense arXiv Mar 2, 2026 · 11w ago

Towards Privacy-Preserving LLM Inference via Collaborative Obfuscation (Technical Report)

Yu Lin, Qizhi Zhang, Wenqiang Ruan et al. · ByteDance · Nanjing University

Defends user input privacy in cloud LLM inference by obfuscating activations to resist internal state inversion attacks

Model Inversion Attack Sensitive Information Disclosure nlp

PDF

defense arXiv Feb 18, 2026 · Feb 2026

SRFed: Mitigating Poisoning Attacks in Privacy-Preserving Federated Learning with Heterogeneous Data

Yiwen Lu · Nanjing University

Defends federated learning against Byzantine poisoning and server-side gradient inference attacks using functional encryption and clustering-based aggregation

Data Poisoning Attack Model Inversion Attack federated-learning

PDF

defense arXiv Feb 12, 2026 · Feb 2026

Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs

Dong Yan, Jian Liang, Ran He et al. · University of Chinese Academy of Sciences · Chinese Academy of Sciences +1 more

Defends against LLM attribute inference attacks using fine-grained anonymization and adversarial suffix optimization to induce model rejection

Sensitive Information Disclosure nlp

1 citations PDF Code

attack arXiv Feb 6, 2026 · Feb 2026

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

Haoyang Hu, Zhejun Jiang, Yueming Lyu et al. · The University of Hong Kong · Nanjing University +1 more

Fine-tunes an LLM as a poison generator to inject robust, chunking-aware malicious content into RAG knowledge bases

Data Poisoning Attack Prompt Injection nlp

PDF

defense arXiv Feb 1, 2026 · Feb 2026

Exposing and Defending the Achilles' Heel of Video Mixture-of-Experts

Songping Wang, Qinglong Liu, Yueming Lyu et al. · Nanjing University · Ltd. +1 more

Proposes component-level adversarial attacks and defenses targeting routers and experts in video MoE models

Input Manipulation Attack vision

1 citations PDF

defense arXiv Feb 1, 2026 · Feb 2026

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons

Xianhui Zhang, Chengyu Xie, Linxia Zhu et al. · Nanjing University of Science and Technology · National University of Singapore +2 more

Identifies sparse cross-lingual safety neurons in LLMs and proposes targeted fine-tuning to close multilingual jailbreak safety gaps

Prompt Injection nlp

PDF Code

defense arXiv Jan 29, 2026 · Jan 2026

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

Chuancheng Shi, Shangze Li, Wenjun Lu et al. · The University of Sydney · Nanjing University of Science and Technology +2 more

Defends LLMs, diffusion models, and MLLMs from jailbreaks by tracing and severing harmful semantic circuits via sparse autoencoders and causal path analysis

Input Manipulation Attack Prompt Injection nlpvisionmultimodalgenerative

PDF

defense arXiv Jan 29, 2026 · Jan 2026

AtPatch: Debugging Transformers via Hot-Fixing Over-Attention

Shihao Weng, Yang Feng, Jincheng Li et al. · Nanjing University · Singapore Management University

Inference-time defense that neutralizes backdoor triggers in transformers by detecting and redistributing anomalous attention maps without modifying weights

Model Poisoning visionnlp

PDF

defense arXiv Jan 23, 2026 · Jan 2026

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

Xianya Fang, Xianying Luo, Yadong Wang et al. · Nanjing University of Aeronautics and Astronautics · Tsinghua University +3 more

Adaptive three-stage LLM defense routes inputs by risk level to counter jailbreaks and prefilling attacks without sacrificing utility

Prompt Injection nlp

PDF

attack arXiv Jan 18, 2026 · Jan 2026

Zero-Permission Manipulation: Can We Trust Large Multimodal Model Powered GUI Agents?

Yi Qian, Kunwei Qian, Xingbang He et al. · Nanjing University · Ltd +1 more

Attacks VLM-powered Android GUI agents by hijacking UI state between observation and action, achieving 100% success with zero permissions

Prompt Injection Excessive Agency multimodal

PDF

Large multimodal model powered GUI agents are emerging as high-privilege operators on mobile platforms, entrusted with perceiving screen content and injecting inputs. However, their design operates under the implicit assumption of Visual Atomicity: that the UI state remains invariant between observation and action. We demonstrate that this assumption is fundamentally invalid in Android, creating a critical attack surface. We present Action Rebinding, a novel attack that allows a seemingly-benign app with zero dangerous permissions to rebind an agent's execution. By exploiting the inevitable observation-to-action gap inherent in the agent's reasoning pipeline, the attacker triggers foreground transitions to rebind the agent's planned action toward the target app. We weaponize the agent's task-recovery logic and Android's UI state preservation to orchestrate programmable, multi-step attack chains. Furthermore, we introduce an Intent Alignment Strategy (IAS) that manipulates the agent's reasoning process to rationalize UI states, enabling it to bypass verification gates (e.g., confirmation dialogs) that would otherwise be rejected. We evaluate Action Rebinding Attacks on six widely-used Android GUI agents across 15 tasks. Our results demonstrate a 100% success rate for atomic action rebinding and the ability to reliably orchestrate multi-step attack chains. With IAS, the success rate in bypassing verification gates increases (from 0% to up to 100%). Notably, the attacker application requires no sensitive permissions and contains no privileged API calls, achieving a 0% detection rate across malware scanners (e.g., VirusTotal). Our findings reveal a fundamental architectural flaw in current agent-OS integration and provide critical insights for the secure design of future agent systems. To access experimental logs and demonstration videos, please contact yi_qian@smail.nju.edu.cn.

vlm multimodal Nanjing University · Ltd · Hefei Comprehensive National Science Center

PDF arXiv DOI

attack arXiv Dec 24, 2025 · Dec 2025

CoTDeceptor:Adversarial Code Obfuscation Against CoT-Enhanced LLM Code Agents

Haoyang Li, Mingjin Li, Jinxin Zuo et al. · Beijing University of Posts and Telecommunications · Chinese Academy of Sciences +3 more

Adversarial code obfuscation framework that exploits CoT reasoning chain weaknesses to evade LLM-based vulnerability detectors

Input Manipulation Attack Prompt Injection nlp

PDF Code

benchmark TrustCom Dec 17, 2025 · Dec 2025

Bits for Privacy: Evaluating Post-Training Quantization via Membership Inference

Chenxiang Zhang, Tongxi Qu, Zhong Li et al. · University of Luxembourg · Nanjing University

Evaluates how post-training quantization affects membership inference vulnerability, finding 1.58-bit models leak an order of magnitude less than full-precision

Membership Inference Attack vision

PDF

attack arXiv Dec 10, 2025 · Dec 2025

Reference Recommendation based Membership Inference Attack against Hybrid-based Recommender Systems

Xiaoxiao Chi, Xuyun Zhang, Yan Wang et al. · Macquarie University · The University of Newcastle +1 more

Novel metric-based membership inference attack against hybrid recommender systems using reference recommendations to infer user training membership

Membership Inference Attack tabular

PDF

Loading more papers…