ML Security Papers

Latest papers

13 papers

benchmark arXiv Feb 18, 2026 · 7w ago

The Vulnerability of LLM Rankers to Prompt Injection Attacks

Yu Yin, Shuai Wang, Bevan Koopman et al. · The University of Queensland · CSIRO

Benchmarks indirect prompt injection attacks on LLM rankers, revealing encoder-decoder architectures are far more resilient than decoder-only models

Prompt Injection nlp

PDF Code

defense arXiv Feb 11, 2026 · 8w ago

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng, Zhongkui Ma, Zihan Wang et al. · The University of Queensland · CSIRO’s Data61 +1 more

Defends collaborative LLM training against gradient inversion by replacing tokens with semantically disconnected yet embedding-proximate shadow substitutes

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning

PDF

attack arXiv Jan 29, 2026 · 9w ago

Noise as a Probe: Membership Inference Attacks on Diffusion Models Leveraging Initial Noise

Puwei Lian, Yujun Cai, Songze Li et al. · Southeast University · The University of Queensland +1 more

Exploits residual semantics in diffusion model noise schedules to perform black-box membership inference without auxiliary data

Membership Inference Attack visiongenerative

PDF

defense arXiv Jan 29, 2026 · 9w ago

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

Mingyang Liao, Yichen Wan, shuchen wu et al. · Baidu Inc. · The University of Queensland +1 more

Training-free dual-cycle framework defends LLM role-playing agents against jailbreaks while preserving persona fidelity via evolving hierarchical knowledge

Prompt Injection nlp

PDF Code

defense arXiv Jan 20, 2026 · 11w ago

SecureSplit: Mitigating Backdoor Attacks in Split Learning

Zhihao Dou, Dongfei Cui, Weida Wang et al. · Case Western Reserve University · Northeast Electric Power University +6 more

Defends split learning against backdoor attacks by transforming embeddings and filtering poisoned ones via majority-voting scheme

Model Poisoning visionfederated-learning

PDF

attack arXiv Jan 15, 2026 · 11w ago

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

Peng-Fei Zhang, Zi Huang · The University of Queensland

Universal multimodal adversarial attack on VLP models using future-aware gradient momentum for images and hierarchical word-importance for text

Input Manipulation Attack visionnlpmultimodal

PDF

attack arXiv Jan 1, 2026 · Jan 2026

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Zongwei Wang, Bincheng Gu, Hongyu Yu et al. · Chongqing University · The University of Queensland +2 more

Belief Poisoning Attack corrupts LLM agent profiles and memory to make agents treat humans as outgroup, bypassing human-oriented safety behaviors

Prompt Injection Excessive Agency nlp

PDF Code

attack arXiv Dec 26, 2025 · Dec 2025

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Mengqi He, Xinyu Tian, Xin Shen et al. · Australian National University · The University of Queensland +1 more

Targets high-entropy VLM decoding positions with adversarial visual perturbations, converting 35-49% of benign outputs to harmful content at 93-95% attack success rate

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

defense arXiv Dec 15, 2025 · Dec 2025

Learning to Generate Cross-Task Unexploitable Examples

Haoxuan Qu, Qiuchi Xiang, Yujun Cai et al. · Lancaster University · The University of Queensland +2 more

Defends personal images from unauthorized ML training by generating cross-task imperceptible perturbations that make training data unlearnable across diverse vision tasks

Data Poisoning Attack vision

PDF

defense arXiv Nov 24, 2025 · Nov 2025

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Zihan Wang, Zhongkui Ma, Xinguo Feng et al. · The University of Queensland · CSIRO’s Data61 +3 more

Defends model IP with key-locked weights that survive fine-tuning, keeping unauthorized inference at near-random performance

Model Theft vision

1 citations PDF

Deep neural networks (DNNs) have become valuable intellectual property of model owners, due to the substantial resources required for their development. To protect these assets in the deployed environment, recent research has proposed model usage control mechanisms to ensure models cannot be used without proper authorization. These methods typically lock the utility of the model by embedding an access key into its parameters. However, they often assume static deployment, and largely fail to withstand continual post-deployment model updates, such as fine-tuning or task-specific adaptation. In this paper, we propose ADALOC, to endow key-based model usage control with adaptability during model evolution. It strategically selects a subset of weights as an intrinsic access key, which enables all model updates to be confined to this key throughout the evolution lifecycle. ADALOC enables using the access key to restore the keyed model to the latest authorized states without redistributing the entire network (i.e., adaptation), and frees the model owner from full re-keying after each model update (i.e., lock preservation). We establish a formal foundation to underpin ADALOC, providing crucial bounds such as the errors introduced by updates restricted to the access key. Experiments on standard benchmarks, such as CIFAR-100, Caltech-256, and Flowers-102, and modern architectures, including ResNet, DenseNet, and ConvNeXt, demonstrate that ADALOC achieves high accuracy under significant updates while retaining robust protections. Specifically, authorized usages consistently achieve strong task-specific performance, while unauthorized usage accuracy drops to near-random guessing levels (e.g., 1.01% on CIFAR-100), compared to up to 87.01% without ADALOC. This shows that ADALOC can offer a practical solution for adaptive and protected DNN deployment in evolving real-world scenarios.

cnn The University of Queensland · CSIRO’s Data61 · City University of Hong Kong +2 more

PDF arXiv DOI

defense arXiv Oct 13, 2025 · Oct 2025

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Zihan Wang, Zhiyong Ma, Zhongkui Ma et al. · The University of Queensland · CSIRO’s Data61 +1 more

Recodes inputs into an authorized model's insensitivity subspace so only that model can process them, blocking unauthorized model exploitation

Model Theft visionmultimodal

3 citations PDF Code

defense arXiv Sep 27, 2025 · Sep 2025

Adaptive Token-Weighted Differential Privacy for LLMs: Not All Tokens Require Equal Protection

Manjiang Yu, Priyanka Singh, Xue Li et al. · The University of Queensland · Institute of Science Tokyo

Token-selective DP-SGD variant concentrates noise on sensitive tokens to prevent LLM training-data extraction while cutting DP overhead by 90%

Model Inversion Attack Sensitive Information Disclosure nlp

1 citations PDF Code

attack arXiv Sep 8, 2025 · Sep 2025

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Shuai Yuan, Zhibo Zhang, Yuxi Li et al. · University of Electronic Science and Technology of China · Huazhong University of Science and Technology +1 more

Injects adversarial perturbations into LLM embedding outputs at inference time to bypass safety alignment without modifying weights or prompts

Input Manipulation Attack Prompt Injection nlp

PDF

Latest papers

The Vulnerability of LLM Rankers to Prompt Injection Attacks

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Noise as a Probe: Membership Inference Attacks on Diffusion Models Leveraging Initial Noise

Stay in Character, Stay Safe: Dual-Cycle Adversarial Self-Evolution for Safety Role-Playing Agents

SecureSplit: Mitigating Backdoor Attacks in Split Learning

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models

When Agents See Humans as the Outgroup: Belief-Dependent Bias in LLM-Powered Agents

Few Tokens Matter: Entropy Guided Attacks on Vision-Language Models

Learning to Generate Cross-Task Unexploitable Examples

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Adaptive Token-Weighted Differential Privacy for LLMs: Not All Tokens Require Equal Protection

Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue