ML Security Papers

Latest papers

28 papers

defense arXiv Apr 21, 2026 · 4w ago

Mechanistic Anomaly Detection via Functional Attribution

Hugo Lyons Keenan, Christopher Leckie, Sarah Erfani · The University of Melbourne

Detects backdoors and adversarial examples by measuring functional coupling between test samples and trusted reference data via influence functions

Model Poisoning Input Manipulation Attack visionnlpmultimodal

PDF

defense arXiv Apr 6, 2026 · 6w ago

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

Tianmeng Fang, Yong Wang, Zetai Kong et al. · Singapore Management University · China University of Mining and Technology +4 more

Defends vision-language models against backdoors using patch augmentation and cross-view regularization to break trigger invariance

Model Poisoning multimodalvisionnlp

PDF

Multimodal large language models have become an important infrastructure for unified processing of visual and linguistic tasks. However, such models are highly susceptible to backdoor implantation during supervised fine-tuning and will steadily output the attacker's predefined harmful responses once a specific trigger pattern is activated. The core challenge of backdoor defense lies in suppressing attack success under low poisoning ratios while preserving the model's normal generation ability. These two objectives are inherently conflicting. Strong suppression often degrades benign performance, whereas weak regularization fails to mitigate backdoor behaviors. To this end, we propose a unified defense framework based on patch augmentation and cross-view regularity, which simultaneously constrains the model's anomalous behaviors in response to triggered patterns from both the feature representation and output distribution levels. Specifically, patch-level data augmentation is combined with cross-view output difference regularization to exploit the fact that backdoor responses are abnormally invariant to non-semantic perturbations and to proactively pull apart the output distributions of the original and perturbed views, thereby significantly suppressing the success rate of backdoor triggering. At the same time, we avoid over-suppression of the model during defense by imposing output entropy constraints, ensuring the quality of normal command generation. Experimental results across three models, two tasks, and six attacks show that our proposed defense method effectively reduces the attack success rate while maintaining a high level of normal text generation capability. Our work enables the secure, controlled deployment of large-scale multimodal models in realistic low-frequency poisoning and covert triggering scenarios.

vlm multimodal llm transformer Singapore Management University · China University of Mining and Technology · The University of Melbourne +3 more

PDF arXiv

attack arXiv Apr 3, 2026 · 6w ago

A Unified Perspective on Adversarial Membership Manipulation in Vision Models

Ruize Gao, Kaiwen Zhou, Yongqiang Chen et al. · National University of Singapore · Knowin AI +2 more

Adversarial perturbations fool membership inference attacks by fabricating fake members; proposes gradient-based detection and robust inference defenses

Membership Inference Attack Input Manipulation Attack vision

PDF

benchmark arXiv Mar 8, 2026 · 10w ago

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li, Wei Zhao, Zhe Li et al. · Singapore Management University · The University of Melbourne +1 more

Benchmarks beneficial uses of LLM backdoors for safety enforcement, access control, and watermarking via trigger conditioning

Model Poisoning Prompt Injection nlp

PDF Code

attack arXiv Feb 16, 2026 · Feb 2026

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

In Chong Choi, Jiacheng Zhang, Feng Liu et al. · The University of Melbourne · The University of Adelaide

Multi-turn jailbreak attack on VLMs that adaptively alternates text and image inputs to bypass safety alignment

Prompt Injection multimodalnlp

PDF Code

defense arXiv Feb 12, 2026 · Feb 2026

Semantic-aware Adversarial Fine-tuning for CLIP

Jiacheng Zhang, Jinhao Li, Hanxun Huang et al. · The University of Melbourne

Defends CLIP zero-shot classifiers via adversarial fine-tuning with semantically richer adversarial examples from LLM-generated description ensembles

Input Manipulation Attack visionnlpmultimodal

PDF Code

attack arXiv Feb 11, 2026 · Feb 2026

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Shuyu Chang, Haiping Huang, Yanjun Zhang et al. · Nanjing University of Posts and Telecommunications · State Key Laboratory of Tibetan Intelligence +5 more

Backdoor attack on code models using sharpness-aware training and Gumbel-Softmax triggers for cross-dataset transferability and stealthiness

Model Poisoning nlp

PDF

Code models are increasingly adopted in software development but remain vulnerable to backdoor attacks via poisoned training data. Existing backdoor attacks on code models face a fundamental trade-off between transferability and stealthiness. Static trigger-based attacks insert fixed dead code patterns that transfer well across models and datasets but are easily detected by code-specific defenses. In contrast, dynamic trigger-based attacks adaptively generate context-aware triggers to evade detection but suffer from poor cross-dataset transferability. Moreover, they rely on unrealistic assumptions of identical data distributions between poisoned and victim training data, limiting their practicality. To overcome these limitations, we propose Sharpness-aware Transferable Adversarial Backdoor (STAB), a novel attack that achieves both transferability and stealthiness without requiring complete victim data. STAB is motivated by the observation that adversarial perturbations in flat regions of the loss landscape transfer more effectively across datasets than those in sharp minima. To this end, we train a surrogate model using Sharpness-Aware Minimization to guide model parameters toward flat loss regions, and employ Gumbel-Softmax optimization to enable differentiable search over discrete trigger tokens for generating context-aware adversarial triggers. Experiments across three datasets and two code models show that STAB outperforms prior attacks in terms of transferability and stealthiness. It achieves a 73.2% average attack success rate after defense, outperforming static trigger-based attacks that fail under defense. STAB also surpasses the best dynamic trigger-based attack by 12.4% in cross-dataset attack success rate and maintains performance on clean inputs.

transformer Nanjing University of Posts and Telecommunications · State Key Laboratory of Tibetan Intelligence · Jiangsu Provincial Key Laboratory of Internet of Things Intelligent Perception and Computing +4 more

PDF arXiv DOI

attack arXiv Feb 1, 2026 · Feb 2026

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu et al. · The University of Melbourne · Singapore Management University +2 more

Adversarial image attack jailbreaks VLMs with universal cross-target and cross-model transferability using a single surrogate model

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF Code

attack arXiv Jan 29, 2026 · Jan 2026

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp

PDF

defense arXiv Jan 19, 2026 · Jan 2026

KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing

Zhenhua Xu, Xiaoning Tian, Wenjun Zeng et al. · Zhejiang University · GenTel.io +4 more

Defends LLM IP by embedding kinship-narrative knowledge into model weights for stealthy, robust ownership verification

Model Theft Model Theft nlp

PDF Code

attack arXiv Jan 19, 2026 · Jan 2026

In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

Anudeex Shetty, Aditya Joshi, Salil S. Kanhere · UNSW Sydney · The University of Melbourne

Novel drunk-persona jailbreak attack on LLMs bypasses safety tuning and induces privacy leaks across five models

Prompt Injection Sensitive Information Disclosure nlp

PDF

benchmark arXiv Dec 23, 2025 · Dec 2025

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Honglin Mu, Jinghao Liu, Kaiyang Wan et al. · Harbin Institute of Technology · MBZUAI +2 more

Benchmarks indirect prompt injection attacks on LLM resume screeners and proposes LoRA-based FIDS defense achieving 26% attack reduction

Prompt Injection nlp

1 citations PDF Code

defense arXiv Dec 13, 2025 · Dec 2025

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Hua Ma, Ruoxi Sun, Minhui Xue et al. · CSIRO’s Data61 · The University of Melbourne +2 more

Defends time-series LLMs against adversarial inputs using sampling-induced divergence to detect perturbed energy forecasting sequences

Input Manipulation Attack timeseriesnlp

PDF

Accurate time-series forecasting is increasingly critical for planning and operations in low-carbon power systems. Emerging time-series large language models (TS-LLMs) now deliver this capability at scale, requiring no task-specific retraining, and are quickly becoming essential components within the Internet-of-Energy (IoE) ecosystem. However, their real-world deployment is complicated by a critical vulnerability: adversarial examples (AEs). Detecting these AEs is challenging because (i) adversarial perturbations are optimized across the entire input sequence and exploit global temporal dependencies, which renders local detection methods ineffective, and (ii) unlike traditional forecasting models with fixed input dimensions, TS-LLMs accept sequences of variable length, increasing variability that complicates detection. To address these challenges, we propose a plug-in detection framework that capitalizes on the TS-LLM's own variable-length input capability. Our method uses sampling-induced divergence as a detection signal. Given an input sequence, we generate multiple shortened variants and detect AEs by measuring the consistency of their forecasts: Benign sequences tend to produce stable predictions under sampling, whereas adversarial sequences show low forecast similarity, because perturbations optimized for a full-length sequence do not transfer reliably to shorter, differently-structured subsamples. We evaluate our approach on three representative TS-LLMs (TimeGPT, TimesFM, and TimeLLM) across three energy datasets: ETTh2 (Electricity Transformer Temperature), NI (Hourly Energy Consumption), and Consumption (Hourly Electricity Consumption and Production). Empirical results confirm strong and robust detection performance across both black-box and white-box attack scenarios, highlighting its practicality as a reliable safeguard for TS-LLM forecasting in real-world energy systems.

llm CSIRO’s Data61 · The University of Melbourne · Monash University +1 more

PDF arXiv DOI

defense arXiv Dec 7, 2025 · Dec 2025

RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

Longjie Zhao, Ziming Hong, Zhenyang Ren et al. · The University of Sydney · The University of Melbourne +1 more

Embeds robust watermarks into 3DGS scenes resistant to diffusion-based editing via low-frequency Gaussian targeting and adversarial training

Output Integrity Attack visiongenerative

1 citations 1 influentialPDF

defense arXiv Nov 28, 2025 · Nov 2025

Watermarks for Embeddings-as-a-Service Large Language Models

Anudeex Shetty · The University of Melbourne

Attacks EaaS embedding watermarks via paraphrasing, then proposes WET linear-transformation watermark robust against model cloning

Model Theft Model Theft nlp

PDF

benchmark arXiv Nov 24, 2025 · Nov 2025

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

Juncheng Li, Yige Li, Hanxun Huang et al. · Fudan University · Singapore Management University +1 more

Benchmarks backdoor attacks on VLMs, finding text triggers achieve 90%+ success at just 1% poisoning rate

Model Poisoning visionnlpmultimodal

PDF Code

attack arXiv Nov 20, 2025 · Nov 2025

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Yige Li, Zhe Li, Wei Zhao et al. · Singapore Management University · The University of Melbourne +1 more

Automates LLM backdoor injection via LLM agents generating semantic triggers, achieving 90%+ success rate while evading state-of-the-art defenses

Model Poisoning Training Data Poisoning Red-Team Agents nlp

2 citations PDF Code

defense arXiv Nov 3, 2025 · Nov 2025

Detecting Generated Images by Fitting Natural Image Distributions

Yonggang Zhang, Jun Nie, Xinmei Tian et al. · The Hong Kong University of Science and Technology · Hong Kong Baptist University +4 more

Proposes ConV, a generated-image detector exploiting data manifold geometry requiring no generated training samples

Output Integrity Attack visiongenerative

2 citations PDF Code

benchmark arXiv Oct 15, 2025 · Oct 2025

Signature in Code Backdoor Detection, how far are we?

Quoc Hung Le, Thanh Le-Cong, Bach Le et al. · North Carolina State University · The University of Melbourne

Benchmarks Spectral Signature backdoor defenses on code LLMs, finds configs suboptimal, proposes NPV proxy metric requiring no retraining

Model Poisoning nlp

PDF

defense arXiv Oct 9, 2025 · Oct 2025

SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening

Murtaza Rangwala, Farag Azzedin, Richard O. Sinnott et al. · The University of Melbourne · King Fahd University of Petroleum and Minerals

Defends decentralized federated learning against Byzantine poisoning attacks using sketch-based neighbor screening to cut communication 50-70%

Data Poisoning Attack federated-learning

1 citations PDF

Loading more papers…

Latest papers

Mechanistic Anomaly Detection via Functional Attribution

A Patch-based Cross-view Regularized Framework for Backdoor Defense in Multimodal Large Language Models

A Unified Perspective on Adversarial Membership Manipulation in Vision Models

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

Semantic-aware Adversarial Fine-tuning for CLIP

Transferable Backdoor Attacks for Code Models via Sharpness-Aware Adversarial Perturbation

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

KinGuard: Hierarchical Kinship-Aware Fingerprinting to Defend Against Large Language Model Stealing

In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

RDSplat: Robust Watermarking Against Diffusion Editing for 3D Gaussian Splatting

Watermarks for Embeddings-as-a-Service Large Language Models

BackdoorVLM: A Benchmark for Backdoor Attacks on Vision-Language Models

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

Detecting Generated Images by Fitting Natural Image Distributions

Signature in Code Backdoor Detection, how far are we?

SketchGuard: Scaling Byzantine-Robust Decentralized Federated Learning via Sketch-Based Screening

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue