ML Security Papers

Latest papers

16 papers

defense arXiv Feb 28, 2026 · 5w ago

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Quoc Minh Nguyen, Trung Le, Jing Wu et al. · Monash University

Defends LLMs against harmful fine-tuning attacks by pre-aligning safety in flat loss regions and gradient-weighting poisoned samples away during fine-tuning

Data Poisoning Attack Training Data Poisoning nlp

PDF

tool arXiv Feb 23, 2026 · 6w ago

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

Kartik Kuckreja, Parul Gupta, Muhammad Haris Khan et al. · MBZUAI · Monash University

Builds an MLLM judge that evaluates reasoning fidelity of deepfake detectors, outperforming 30x larger baselines at 96.2% accuracy

Output Integrity Attack visionmultimodal

PDF Code

tool arXiv Feb 4, 2026 · 8w ago

SOGPTSpotter: Detecting ChatGPT-Generated Answers on Stack Overflow

Suyu Ma, Chunyang Chen, Hourieh Khalajzadeh et al. · CSIRO's Data61 · Technical University of Munich +2 more

Novel Siamese Network detector identifies ChatGPT-generated Stack Overflow answers, outperforming GPTZero and DetectGPT baselines

Output Integrity Attack nlp

PDF

attack arXiv Jan 29, 2026 · 9w ago

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

Ningyuan He, Ronghong Huang, Qianqian Tang et al. · University of Science and Technology of China · Shandong University +1 more

Zero-query black-box text attacks evade LLM-based in-context learning classifiers with 95.3% success, plus joint defense recipe

Prompt Injection nlp

PDF Code

benchmark arXiv Jan 14, 2026 · 11w ago

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Fengchao Chen, Tingmin Wu, Van Nguyen et al. · Monash University · CSIRO’s Data61

Benchmarks user-mediated indirect prompt injection attacks on 12 commercial LLM agents, showing 92%+ safety bypass and excessive agency risks

Prompt Injection Excessive Agency nlp

2 citations PDF

defense arXiv Dec 13, 2025 · Dec 2025

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Hua Ma, Ruoxi Sun, Minhui Xue et al. · CSIRO’s Data61 · The University of Melbourne +2 more

Defends time-series LLMs against adversarial inputs using sampling-induced divergence to detect perturbed energy forecasting sequences

Input Manipulation Attack timeseriesnlp

PDF

Accurate time-series forecasting is increasingly critical for planning and operations in low-carbon power systems. Emerging time-series large language models (TS-LLMs) now deliver this capability at scale, requiring no task-specific retraining, and are quickly becoming essential components within the Internet-of-Energy (IoE) ecosystem. However, their real-world deployment is complicated by a critical vulnerability: adversarial examples (AEs). Detecting these AEs is challenging because (i) adversarial perturbations are optimized across the entire input sequence and exploit global temporal dependencies, which renders local detection methods ineffective, and (ii) unlike traditional forecasting models with fixed input dimensions, TS-LLMs accept sequences of variable length, increasing variability that complicates detection. To address these challenges, we propose a plug-in detection framework that capitalizes on the TS-LLM's own variable-length input capability. Our method uses sampling-induced divergence as a detection signal. Given an input sequence, we generate multiple shortened variants and detect AEs by measuring the consistency of their forecasts: Benign sequences tend to produce stable predictions under sampling, whereas adversarial sequences show low forecast similarity, because perturbations optimized for a full-length sequence do not transfer reliably to shorter, differently-structured subsamples. We evaluate our approach on three representative TS-LLMs (TimeGPT, TimesFM, and TimeLLM) across three energy datasets: ETTh2 (Electricity Transformer Temperature), NI (Hourly Energy Consumption), and Consumption (Hourly Electricity Consumption and Production). Empirical results confirm strong and robust detection performance across both black-box and white-box attack scenarios, highlighting its practicality as a reliable safeguard for TS-LLM forecasting in real-world energy systems.

llm CSIRO’s Data61 · The University of Melbourne · Monash University +1 more

PDF arXiv DOI

attack arXiv Nov 18, 2025 · Nov 2025

Certified but Fooled! Breaking Certified Defences with Ghost Certificates

Quoc Viet Vo, Tashreque M. Haq, Paul Montague et al. · University of Adelaide · Defence Science and Technology Group +1 more

Imperceptible adversarial examples spoof randomized-smoothing certificates, making misclassified inputs appear strongly certified to bypass DensePure and similar defenses

Input Manipulation Attack vision

PDF Code

defense arXiv Oct 16, 2025 · Oct 2025

An Information Asymmetry Game for Trigger-based DNN Model Watermarking

Chaoyue Huang, Gejian Zhao, Hanzhou Wu et al. · Shanghai University · Guizhou Normal University +2 more

Game-theoretic framework for robust DNN model watermarking derives attacker's optimal pruning budget and exponential WSR lower bound

Model Theft vision

PDF

defense arXiv Oct 13, 2025 · Oct 2025

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

Zihan Wang, Zhiyong Ma, Zhongkui Ma et al. · The University of Queensland · CSIRO’s Data61 +1 more

Recodes inputs into an authorized model's insensitivity subspace so only that model can process them, blocking unauthorized model exploitation

Model Theft visionmultimodal

3 citations PDF Code

defense arXiv Oct 6, 2025 · Oct 2025

SFANet: Spatial-Frequency Attention Network for Deepfake Detection

Vrushank Ahire, Aniruddh Muley, Shivam Zample et al. · Indian Institute of Technology Ropar · Monash University

Ensemble deepfake detector combining Swin Transformers, ViTs, and texture features with frequency splitting and face-region attention

Output Integrity Attack vision

PDF

tool arXiv Sep 26, 2025 · Sep 2025

"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors

Yue Liu, Yanjie Zhao, Yunbo Lyu et al. · Singapore Management University · Huazhong University of Science and Technology +1 more

Empirical study and testing framework showing indirect prompt injection hijacks agentic AI coding editors with 84% attack success rate

Prompt Injection Excessive Agency nlp

1 citations PDF

defense arXiv Sep 21, 2025 · Sep 2025

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

Rui Yang, Michael Fu, Chakkrit Tantithamthavorn et al. · Monash University · The University of Melbourne +1 more

Defends LLM guardrails against obfuscation- and template-based jailbreaks using a deciphering layer and LoRA fine-tuning

Prompt Injection nlp

PDF

defense arXiv Sep 21, 2025 · Sep 2025

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

Rui Yang, Michael Fu, Chakkrit Tantithamthavorn et al. · Monash University · The University of Melbourne +1 more

Adaptive LLM guardrail using OOD detection and continual learning to defend against novel jailbreak attacks post-deployment

Prompt Injection nlp

PDF Code

defense arXiv Sep 7, 2025 · Sep 2025

RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

Zhengquan Luo, Chi Liu, Dongfu Xiao et al. · City University of Macau · Monash University +1 more

Defends biometric privacy by generating adversarial GAN perturbations that blind black-box retinal age prediction models while preserving diagnostic image utility

Input Manipulation Attack vision

PDF

survey arXiv Aug 29, 2025 · Aug 2025

SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing Through Examination of Generation Methods, Content Characteristics, and Countermeasures

Fengchao Chen, Tingmin Wu, Van Nguyen et al. · Monash University · CSIRO Data 61

SoK survey taxonomizing LLM safety guardrail breaching and LLM-generated phishing content detection gaps across generation methods and defenses

Output Integrity Attack Prompt Injection nlp

PDF

attack arXiv Aug 8, 2025 · Aug 2025

Membership Inference Attack with Partial Features

Xurun Wang, Guangrui Liu, Xinjie Li et al. · Harbin Institute of Technology · Monash University +1 more

Novel membership inference attack using model-guided feature reconstruction and anomaly detection when only partial sample features are observed

Membership Inference Attack visiontabular

PDF

Latest papers

Antibody: Strengthening Defense Against Harmful Fine-Tuning for Large Language Models via Attenuating Harmful Gradient Influence

Pixels Don't Lie (But Your Detector Might): Bootstrapping MLLM-as-a-Judge for Trustworthy Deepfake Detection and Reasoning Supervision

SOGPTSpotter: Detecting ChatGPT-Generated Answers on Stack Overflow

ICL-EVADER: Zero-Query Black-Box Evasion Attacks on In-Context Learning and Their Defenses

Too Helpful to Be Safe: User-Mediated Attacks on Planning and Web-Use Agents

Keep the Lights On, Keep the Lengths in Check: Plug-In Adversarial Detection for Time-Series LLMs in Energy Forecasting

Certified but Fooled! Breaking Certified Defences with Ghost Certificates

An Information Asymmetry Game for Trigger-based DNN Model Watermarking

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

SFANet: Spatial-Frequency Attention Network for Deepfake Detection

"Your AI, My Shell": Demystifying Prompt Injection Attacks on Agentic AI Coding Editors

DecipherGuard: Understanding and Deciphering Jailbreak Prompts for a Safer Deployment of Intelligent Software Systems

AdaptiveGuard: Towards Adaptive Runtime Safety for LLM-Powered Software

RetinaGuard: Obfuscating Retinal Age in Fundus Images for Biometric Privacy Preserving

SoK: Exposing the Generation and Detection Gaps in LLM-Generated Phishing Through Examination of Generation Methods, Content Characteristics, and Countermeasures

Membership Inference Attack with Partial Features

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue