ML Security Papers

Latest papers

19 papers

defense ACL 2026 (Findings) Apr 19, 2026 · 4w ago

Continual Safety Alignment via Gradient-Based Sample Selection

Thong Bach, Dung Nguyen, Thao Minh Le et al. · Deakin University · Pennsylvania State University

Gradient-based sample filtering during fine-tuning that preserves LLM safety alignment by removing high-gradient samples causing drift

Prompt Injection nlp

PDF

tool arXiv Feb 4, 2026 · Feb 2026

SOGPTSpotter: Detecting ChatGPT-Generated Answers on Stack Overflow

Suyu Ma, Chunyang Chen, Hourieh Khalajzadeh et al. · CSIRO's Data61 · Technical University of Munich +2 more

Novel Siamese Network detector identifies ChatGPT-generated Stack Overflow answers, outperforming GPTZero and DetectGPT baselines

Output Integrity Attack nlp

PDF

attack arXiv Feb 1, 2026 · Feb 2026

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Kaiyuan Cui, Yige Li, Yutao Wu et al. · The University of Melbourne · Singapore Management University +2 more

Adversarial image attack jailbreaks VLMs with universal cross-target and cross-model transferability using a single surrogate model

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF Code

attack arXiv Jan 29, 2026 · Jan 2026

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Xiang Zheng, Yutao Wu, Hanxun Huang et al. · City University of Hong Kong · Deakin University +4 more

Self-evolving agent framework extracts hidden system prompts from 41 commercial LLMs using UCB-guided natural language probing strategies

Sensitive Information Disclosure Prompt Injection nlp

PDF

benchmark arXiv Jan 15, 2026 · Jan 2026

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

Xingjun Ma, Yixu Wang, Hengyuan Xu et al. · Fudan University · Shanghai Innovation Institute +2 more

Benchmarks six frontier LLMs/VLMs on adversarial, multilingual, and compliance safety, revealing all collapse below 6% worst-case safety rates

Prompt Injection nlpmultimodalvisiongenerative

1 citations PDF

benchmark arXiv Jan 8, 2026 · Jan 2026

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Yunhao Feng, Yige Li, Yutao Wu et al. · Fudan University · Alibaba Group +4 more

Benchmark framework systematizing backdoor attacks across planning, memory, and tool-use stages of LLM agent workflows

Model Poisoning Excessive Agency nlpmultimodal

1 citations PDF Code

defense arXiv Dec 24, 2025 · Dec 2025

AegisAgent: An Autonomous Defense Agent Against Prompt Injection Attacks in LLM-HARs

Yihan Wang, Huanqi Yang, Shantanu Pal et al. · City University of Hong Kong · Deakin University

Autonomous agent defense against prompt injection in LLM-based wearable HAR systems, reducing attack success rate by 30%

Prompt Injection Blue-Team Agents nlpmultimodal

1 citations PDF

defense arXiv Dec 17, 2025 · Dec 2025

TrajSyn: Privacy-Preserving Dataset Distillation from Federated Model Trajectories for Server-Side Adversarial Training

Mukur Gupta, Niharika Gupta, Saifur Rahman et al. · Columbia University · Vellore Institute of Technology +1 more

Defends FL models against adversarial attacks by synthesizing server-side training data from client model trajectories, enabling adversarial training without client data access

Input Manipulation Attack visionfederated-learning

PDF

benchmark arXiv Dec 16, 2025 · Dec 2025

Black-Box Auditing of Quantum Model: Lifted Differential Privacy with Quantum Canaries

Baobao Song, Shiva Raj Pokhrel, Athanasios V. Vasilakos et al. · University of Technology Sydney · Deakin University +2 more

Black-box canary framework audits quantum ML models for memorization, empirically lower-bounding privacy leakage via quantum differential privacy

Membership Inference Attack

PDF

defense arXiv Dec 4, 2025 · Dec 2025

Physics-Guided Deepfake Detection for Voice Authentication Systems

Alireza Mohammadi, Keshav Sood, Dhananjay Thiruvady et al. · Deakin University

Defends voice authentication against audio deepfakes and FL poisoning via physics-guided features with Bayesian uncertainty screening

Output Integrity Attack Data Poisoning Attack audiofederated-learning

PDF

defense arXiv Nov 22, 2025 · Nov 2025

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Thong Bach, Thanh Nguyen-Tang, Dung Nguyen et al. · Deakin University · New Jersey Institute of Technology +1 more

Restores LLM safety alignment after fine-tuning by exploiting shared loss-landscape geometry with curvature-aware second-order optimization

Transfer Learning Attack Prompt Injection nlp

1 citations PDF

defense arXiv Nov 15, 2025 · Nov 2025

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

Thong Bach, Dung Nguyen, Thao Minh Le et al. · Deakin University · Pennsylvania State University

Defends LLMs against jailbreaks by fixing gradient-decay-induced incomplete safety alignment via base-favored token penalties and teacher distillation

Input Manipulation Attack Prompt Injection nlp

PDF

defense Industrial Conference on Data ... Oct 16, 2025 · Oct 2025

TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening

Nam Le, Leo Yu Zhang, Kewen Liao et al. · Deakin University · Griffith University

Detects backdoored inputs by screening activations against per-class tubular manifold neighborhoods across all layers

Model Poisoning vision

PDF Code

attack arXiv Oct 11, 2025 · Oct 2025

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Yutao Wu, Xiao Liu, Yinghui Li et al. · Deakin University · Fudan University +1 more

Poisons RAG knowledge bases with few adversarial documents to flip LLM fact-checking decisions at 86% ASR, black-box and transfer-robust.

Data Poisoning Attack Prompt Injection nlp

PDF

defense arXiv Sep 18, 2025 · Sep 2025

Causal Fingerprints of AI Generative Models

Hui Xu, Chi Liu, Congcong Zhu et al. · City University of Macau · Qilu University of Technology +1 more

Proposes causal fingerprinting framework to attribute AI-generated images to source GANs or diffusion models via disentangled model traces

Output Integrity Attack visiongenerative

PDF

attack arXiv Sep 9, 2025 · Sep 2025

Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood et al. · Deakin University

Black-box attack manipulates inaudible spectral regions of AI-generated audio to evade voice authentication and anti-spoofing ML models

Input Manipulation Attack audio

PDF

survey arXiv Aug 22, 2025 · Aug 2025

A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems

Kamel Kamel, Keshav Sood, Hridoy Sankar Dutta et al. · Deakin University

Surveys adversarial, data poisoning, deepfake, and anti-countermeasure attacks targeting voice authentication ML models and anti-spoofing systems

Input Manipulation Attack Data Poisoning Attack Output Integrity Attack audio

PDF

defense Advances in Neural Information... Aug 13, 2025 · Aug 2025

SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

Yachao Liang, Min Yu, Gang Li et al. · Chinese Academy of Sciences · University of Chinese Academy of Sciences +3 more

Audio-visual speech representation learning enables deepfake video detection with SOTA cross-dataset generalization and zero fake training data

Output Integrity Attack visionaudiomultimodal

PDF Code

defense arXiv Aug 4, 2025 · Aug 2025

FedLAD: A Linear Algebra Based Data Poisoning Defence for Federated Learning

Qi Xiong, Hai Dong, Nasrin Sohrabi et al. · RMIT University · Deakin University

Defends federated learning against Sybil data poisoning by modeling aggregation as a linear algebra problem to filter malicious updates

Data Poisoning Attack visionnlpfederated-learning

PDF Code

Latest papers

Continual Safety Alignment via Gradient-Based Sample Selection

SOGPTSpotter: Detecting ChatGPT-Generated Answers on Stack Overflow

Toward Universal and Transferable Jailbreak Attacks on Vision-Language Models

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

AegisAgent: An Autonomous Defense Agent Against Prompt Injection Attacks in LLM-HARs

TrajSyn: Privacy-Preserving Dataset Distillation from Federated Model Trajectories for Server-Side Adversarial Training

Black-Box Auditing of Quantum Model: Lifted Differential Privacy with Quantum Canaries

Physics-Guided Deepfake Detection for Voice Authentication Systems

Curvature-Aware Safety Restoration In LLMs Fine-Tuning

Rethinking Deep Alignment Through The Lens Of Incomplete Learning

TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Causal Fingerprints of AI Generative Models

Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

A Survey of Threats Against Voice Authentication and Anti-Spoofing Systems

SpeechForensics: Audio-Visual Speech Representation Learning for Face Forgery Detection

FedLAD: A Linear Algebra Based Data Poisoning Defence for Federated Learning

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue