ML Security Papers

Latest papers

17 papers

attack arXiv Mar 24, 2026 · 13d ago

AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents

Yutao Luo, Haotian Zhu, Shuchao Pang et al. · Nanjing University of Science and Technology · Macquarie University +3 more

Backdoor attack on mobile GUI agents using benign notification icons to trigger malicious actions with 90%+ success rate

Model Poisoning visionmultimodal

PDF

attack arXiv Mar 16, 2026 · 21d ago

From Storage to Steering: Memory Control Flow Attacks on LLM Agents

Zhenlin Xu, Xiaogang Zhu, Yu Yao et al. · Adelaide University · The University of Sydney +1 more

Memory poisoning attack on LLM agents that hijacks tool selection control flow across tasks via malicious memory retrieval

Prompt Injection Excessive Agency nlp

PDF

defense arXiv Feb 11, 2026 · 7w ago

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

Xinguo Feng, Zhongkui Ma, Zihan Wang et al. · The University of Queensland · CSIRO’s Data61 +1 more

Defends collaborative LLM training against gradient inversion by replacing tokens with semantically disconnected yet embedding-proximate shadow substitutes

Model Inversion Attack Sensitive Information Disclosure nlpfederated-learning

PDF

attack arXiv Jan 23, 2026 · 10w ago

DeMark: A Query-Free Black-Box Attack on Deepfake Watermarking Defenses

Wei Song, Zhenchang Xing, Liming Zhu et al. · UNSW Sydney · CSIRO’s Data61

Attacks deepfake watermarking defenses using compressive sensing to suppress watermark signals without querying the target model

Output Integrity Attack visiongenerative

PDF

benchmark arXiv Jan 14, 2026 · 11w ago

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Fengchao Chen, Tingmin Wu, Van Nguyen et al. · Monash University · CSIRO’s Data61

Benchmarks user-mediated indirect prompt injection attacks on 12 commercial LLM agents, showing 92%+ safety bypass and excessive agency risks

Prompt Injection Excessive Agency nlp

2 citations PDF

defense arXiv Jan 3, 2026 · Jan 2026

NADD: Amplifying Noise for Effective Diffusion-based Adversarial Purification

David D. Nguyen, The-Anh Ta, Yansong Gao et al. · CSIRO’s Data61 · University of Western Australia

Diffusion-based adversarial purification defense that amplifies noise and uses ring proximity correction for 44.23% robust accuracy on ImageNet, 47× faster than prior art

Input Manipulation Attack vision

PDF

defense arXiv Dec 13, 2025 · Dec 2025

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Hua Ma, Ruoxi Sun, Minhui Xue et al. · CSIRO’s Data61 · The University of Melbourne +2 more

Defends time-series LLMs against adversarial inputs using sampling-induced divergence to detect perturbed energy forecasting sequences

Input Manipulation Attack timeseriesnlp

PDF

Accurate time-series forecasting is increasingly critical for planning and operations in low-carbon power systems. Emerging time-series large language models (TS-LLMs) now deliver this capability at scale, requiring no task-specific retraining, and are quickly becoming essential components within the Internet-of-Energy (IoE) ecosystem. However, their real-world deployment is complicated by a critical vulnerability: adversarial examples (AEs). Detecting these AEs is challenging because (i) adversarial perturbations are optimized across the entire input sequence and exploit global temporal dependencies, which renders local detection methods ineffective, and (ii) unlike traditional forecasting models with fixed input dimensions, TS-LLMs accept sequences of variable length, increasing variability that complicates detection. To address these challenges, we propose a plug-in detection framework that capitalizes on the TS-LLM's own variable-length input capability. Our method uses sampling-induced divergence as a detection signal. Given an input sequence, we generate multiple shortened variants and detect AEs by measuring the consistency of their forecasts: Benign sequences tend to produce stable predictions under sampling, whereas adversarial sequences show low forecast similarity, because perturbations optimized for a full-length sequence do not transfer reliably to shorter, differently-structured subsamples. We evaluate our approach on three representative TS-LLMs (TimeGPT, TimesFM, and TimeLLM) across three energy datasets: ETTh2 (Electricity Transformer Temperature), NI (Hourly Energy Consumption), and Consumption (Hourly Electricity Consumption and Production). Empirical results confirm strong and robust detection performance across both black-box and white-box attack scenarios, highlighting its practicality as a reliable safeguard for TS-LLM forecasting in real-world energy systems.

llm CSIRO’s Data61 · The University of Melbourne · Monash University +1 more

PDF arXiv DOI

defense arXiv Nov 24, 2025 · Nov 2025

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

Zihan Wang, Zhongkui Ma, Xinguo Feng et al. · The University of Queensland · CSIRO’s Data61 +3 more

Defends model IP with key-locked weights that survive fine-tuning, keeping unauthorized inference at near-random performance

Model Theft vision

1 citations PDF

Deep neural networks (DNNs) have become valuable intellectual property of model owners, due to the substantial resources required for their development. To protect these assets in the deployed environment, recent research has proposed model usage control mechanisms to ensure models cannot be used without proper authorization. These methods typically lock the utility of the model by embedding an access key into its parameters. However, they often assume static deployment, and largely fail to withstand continual post-deployment model updates, such as fine-tuning or task-specific adaptation. In this paper, we propose ADALOC, to endow key-based model usage control with adaptability during model evolution. It strategically selects a subset of weights as an intrinsic access key, which enables all model updates to be confined to this key throughout the evolution lifecycle. ADALOC enables using the access key to restore the keyed model to the latest authorized states without redistributing the entire network (i.e., adaptation), and frees the model owner from full re-keying after each model update (i.e., lock preservation). We establish a formal foundation to underpin ADALOC, providing crucial bounds such as the errors introduced by updates restricted to the access key. Experiments on standard benchmarks, such as CIFAR-100, Caltech-256, and Flowers-102, and modern architectures, including ResNet, DenseNet, and ConvNeXt, demonstrate that ADALOC achieves high accuracy under significant updates while retaining robust protections. Specifically, authorized usages consistently achieve strong task-specific performance, while unauthorized usage accuracy drops to near-random guessing levels (e.g., 1.01% on CIFAR-100), compared to up to 87.01% without ADALOC. This shows that ADALOC can offer a practical solution for adaptive and protected DNN deployment in evolving real-world scenarios.

cnn The University of Queensland · CSIRO’s Data61 · City University of Hong Kong +2 more

PDF arXiv DOI

defense arXiv Nov 10, 2025 · Nov 2025

E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

Zhisheng Zhang, Derui Wang, Yifan Mi et al. · Tsinghua University · Beijing University of Posts and Telecommunications +4 more

Proactive adversarial audio perturbations disrupt LLM-based voice cloning by targeting speaker encoders and ASR transcription simultaneously

Input Manipulation Attack Output Integrity Attack audionlp

PDF Code

defense arXiv Oct 30, 2025 · Oct 2025

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

Weifei Jin, Yuxin Cao, Junjie Su et al. · Beijing University of Posts and Telecommunications · National University of Singapore +3 more

Defends Audio-Language Models against audio-based jailbreaks using universal acoustic perturbations that activate inherent model safety shortcuts

Input Manipulation Attack Prompt Injection audiomultimodalnlp

1 citations PDF Code

benchmark arXiv Oct 27, 2025 · Oct 2025

Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions

Razaib Tariq, Minji Heo, Simon S. Woo et al. · Sungkyunkwan University · CSIRO’s Data61

Benchmarks 15 deepfake detectors against Moiré artifacts, showing up to 25.4% accuracy drop and demoiréing methods making detection worse

Output Integrity Attack vision

PDF

defense arXiv Oct 13, 2025 · Oct 2025

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Zihan Wang, Zhiyong Ma, Zhongkui Ma et al. · The University of Queensland · CSIRO’s Data61 +1 more

Recodes inputs into an authorized model's insensitivity subspace so only that model can process them, blocking unauthorized model exploitation

Model Theft visionmultimodal

3 citations PDF Code

attack arXiv Sep 25, 2025 · Sep 2025

Poisoning Prompt-Guided Sampling in Video Large Language Models

Yuxin Cao, Wei Song, Jingling Xue et al. · National University of Singapore · University of New South Wales +1 more

Black-box adversarial perturbation attack suppresses harmful frame selection in VideoLLM prompt-guided sampling, achieving 82–99% success

Input Manipulation Attack Prompt Injection visionnlpmultimodal

1 citations PDF

survey arXiv Sep 10, 2025 · Sep 2025

Adversarial Attacks Against Automated Fact-Checking: A Survey

Fanzhen Liu, Alsharif Abuadbba, Kristen Moore et al. · Macquarie University · CSIRO’s Data61 +1 more

Surveys adversarial attacks against automated fact-checking ML models, covering claim manipulation, evidence injection, and adversary-aware defenses

Input Manipulation Attack Data Poisoning Attack Prompt Injection nlpmultimodal

PDF Code

attack arXiv Aug 21, 2025 · Aug 2025

Retrieval-Augmented Review Generation for Poisoning Recommender Systems

Shiyi Yang, Xinshu Li, Guanglin Zhou et al. · University of New South Wales · CSIRO’s Data61 +2 more

Poisons recommender systems by injecting LLM-generated fake user profiles using retrieval-augmented ICL and jailbreaking to evade detection

Data Poisoning Attack nlp

PDF

attack arXiv Aug 14, 2025 · Aug 2025

Failures to Surface Harmful Contents in Video Large Language Models

Yuxin Cao, Wei Song, Derui Wang et al. · National University of Singapore · University of New South Wales +1 more

Three black-box attacks exploit VideoLLM architectural blind spots to hide harmful video content from generated summaries with >90% success rate

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF Code

defense arXiv Aug 7, 2025 · Aug 2025

From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization

Farah Wahida, M.A.P. Chamikara, Yashothara Shanmugarasa et al. · RMIT University · CSIRO’s Data61 +1 more

Uses VLM ensemble majority voting to detect and neutralize backdoor-poisoned training images in face recognition systems

Model Poisoning vision

PDF

Latest papers

AgentRAE: Remote Action Execution through Notification-based Visual Backdoors against Screenshots-based Mobile GUI Agents

From Storage to Steering: Memory Control Flow Attacks on LLM Agents

Mitigating Gradient Inversion Risks in Language Models via Token Obfuscation

DeMark: A Query-Free Black-Box Attack on Deepfake Watermarking Defenses

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

NADD: Amplifying Noise for Effective Diffusion-based Adversarial Purification

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Re-Key-Free, Risky-Free: Adaptable Model Usage Control

E2E-VGuard: Adversarial Prevention for Production LLM-based End-To-End Speech Synthesis

ALMGuard: Safety Shortcuts and Where to Find Them as Guardrails for Audio-Language Models

Through the Lens: Benchmarking Deepfake Detectors Against Moiré-Induced Distortions

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Poisoning Prompt-Guided Sampling in Video Large Language Models

Adversarial Attacks Against Automated Fact-Checking: A Survey

Retrieval-Augmented Review Generation for Poisoning Recommender Systems

Failures to Surface Harmful Contents in Video Large Language Models

From Detection to Correction: Backdoor-Resilient Face Recognition via Vision-Language Trigger Detection and Noise-Based Neutralization

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue