ML Security Papers

Latest papers

41 papers

attack arXiv Mar 29, 2026 · 10d ago

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Duanyi Yao, Changyue Li, Zhicong Huang et al. · Hong Kong University of Science and Technology · The Chinese University of Hong Kong +2 more

Semantic backdoor attack on VLMs that injects ads when users ask recommendation questions about specific content categories

Model Poisoning multimodalvisionnlp

PDF

defense arXiv Mar 12, 2026 · 27d ago

EmbTracker: Traceable Black-box Watermarking for Federated Language Models

Haodong Zhao, Jinming Hu, Yijie Bai et al. · Shanghai Jiao Tong University · Ant Group +2 more

Embeds per-client backdoor watermarks in federated LMs to trace model leaks to individual culprits via black-box queries

Model Theft Model Poisoning nlpfederated-learningmultimodal

PDF

survey arXiv Mar 12, 2026 · 27d ago

Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats

Xinhao Deng, Yixiang Zhang, Jiaqing Wu et al. · Ant Group · Tsinghua University

Proposes five-layer lifecycle security framework for autonomous LLM agents, analyzing prompt injection, supply chain, memory poisoning, and intent drift threats

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Mar 8, 2026 · 4w ago

EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification

Binjia Zhou, Dawei Luo, Shuai Chen et al. · Zhejiang University · Ant Group

Proposes VLM-based explainable deepfake detector with chain-of-thought reasoning and RL self-evolution for reliable forgery identification

Output Integrity Attack visionmultimodal

PDF

defense arXiv Feb 28, 2026 · 5w ago

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

Haodong Zhao, Jinming Hu, Zhaomin Wu et al. · Shanghai Jiao Tong University · National University of Singapore +1 more

Defends federated LLM instruction tuning against interspersed backdoor poisoning using frequency-domain gradient signals and global clustering

Model Poisoning Data Poisoning Attack nlpfederated-learning

PDF Code

defense arXiv Feb 24, 2026 · 6w ago

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Che Wang, Fuyao Zhang, Jiaming Zhang et al. · Peking University · Nanyang Technological University +2 more

Defends LLM agents against indirect prompt injection via latent-space probing and attention steering without over-refusal

Prompt Injection nlpmultimodal

PDF

attack arXiv Feb 24, 2026 · 6w ago

AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Che Wang, Jiaming Zhang, Ziqi Zhang et al. · Peking University · Nanyang Technological University +1 more

Adaptive indirect prompt injection attack on agentic LLMs that selects stealthy MCP tools and optimizes prompts to evade defenses

Prompt Injection Insecure Plugin Design nlp

PDF

attack arXiv Feb 18, 2026 · 7w ago

Automating Agent Hijacking via Structural Template Injection

Xinhao Deng, Jiaqing Wu, Miao Chen et al. · Tsinghua University · Ant Group +1 more

Automated indirect prompt injection exploiting chat template tokens to hijack LLM agents, using Bayesian-optimized templates transferable to black-box commercial models

Prompt Injection nlp

1 citations PDF

tool arXiv Feb 9, 2026 · 8w ago

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

Hao Tan, Jun Lan, Senyuan Shi et al. · Institute of Automation · Ant Group +2 more

Detects AI-generated videos using MLLMs enhanced with perception pretext reinforcement learning and a new 3K-video benchmark

Output Integrity Attack visionmultimodalnlp

PDF Code

defense arXiv Jan 30, 2026 · 9w ago

Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

Xiaoxuan Guo, Yuankun Xie, Haonan Cheng et al. · Communication University of China · Ant Group

Enhances audio LLMs for speech deepfake detection by injecting CQT spectrograms to expose acoustic artifacts hidden by semantic bias

Output Integrity Attack audionlp

PDF

attack arXiv Jan 30, 2026 · 9w ago

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

Manyi Li, Yufan Liu, Lai Jiang et al. · University of the Chinese Academy of Sciences · Chinese Academy of Sciences +2 more

Attacks machine unlearning defenses in diffusion models by optimizing initial latent variables to reactivate erased NSFW knowledge

Input Manipulation Attack visiongenerative

PDF Code

defense arXiv Jan 29, 2026 · 9w ago

FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

Xiaoyu Xu, Minxin Du, Kun Fang et al. · The Hong Kong Polytechnic University · Ant Group

Defends continual LLM unlearning of PII, copyright, and harmful content against adversarial recovery via relearning and quantization attacks

Sensitive Information Disclosure nlp

PDF Code

benchmark arXiv Jan 20, 2026 · 11w ago

VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content

Shengyi Wu, Yan Hong, Shengyao Chen et al. · Shanghai Jiao Tong University · Ant Group

Benchmark dataset of 775K images and multi-task transformer framework for detecting AI-generated virtual try-on content in e-commerce

Output Integrity Attack visiongenerative

PDF

defense arXiv Jan 19, 2026 · 11w ago

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Guanghao Zhou, Panjia Qiu, Cen Chen et al. · East China Normal University · Ant Group

Post-hoc LLM safety re-alignment via low-rank safety subspace fusion to restore guardrails degraded by fine-tuning

Transfer Learning Attack Prompt Injection nlp

3 citations 1 influentialPDF

defense arXiv Jan 6, 2026 · Jan 2026

Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

Yuankun Xie, Xiaoxuan Guo, Jiayi Zhou et al. · Communication University of China · Ant Group +1 more

Proposes FT-GRPO, a reinforcement-learning-based ALLM training paradigm for interpretable audio deepfake detection across all audio types

Output Integrity Attack audioreinforcement-learningnlp

4 citations PDF Code

defense arXiv Dec 25, 2025 · Dec 2025

LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis

Fanwei Zeng, Changtao Miao, Jing Huang et al. · Ant Group · Nanyang Technological University

Proposes a unified VLM framework with chain-of-thought reasoning to detect, localize, and explain text-centric image forgeries

Output Integrity Attack visionnlpmultimodal

1 citations PDF

defense arXiv Dec 18, 2025 · Dec 2025

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

Jirui Yang, Hengqi Guo, Zhihui Lu et al. · Fudan University · Ant Group +1 more

Defends LLMs against harmful prompts by comparing refusal vs. agreement prefix log-probabilities with near-zero inference overhead

Prompt Injection nlp

PDF

defense arXiv Dec 7, 2025 · Dec 2025

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

Tianhang Zhao, Wei Du, Haodong Zhao et al. · Shanghai Jiao Tong University · Ant Group

Defends PLMs against transferable backdoors that survive fine-tuning via contrastive trigger search and dual-stage purification

Model Poisoning Transfer Learning Attack nlp

3 citations PDF Code

benchmark arXiv Nov 29, 2025 · Nov 2025

MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

Mengxue Hu, Yunfeng Diao, Changtao Miao et al. · Hefei University of Technology · Ant Group +1 more

Introduces MVAD, the first general-purpose dataset for detecting AI-generated multimodal video-audio content across diverse generators and forgery patterns

Output Integrity Attack visionaudiomultimodalgenerative

1 citations PDF Code

attack arXiv Nov 26, 2025 · Nov 2025

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He, Guanyu Hou, Hongwei Li et al. · University of Electronic Science and Technology of China · University of Manchester +3 more

Automated red-teaming framework crafts temporally-aware prompts to jailbreak T2V model safety filters, achieving 80%+ attack success rate

Prompt Injection visionnlpgenerativemultimodal

PDF

Loading more papers…

Latest papers

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

EmbTracker: Traceable Black-box Watermarking for Federated Language Models

Taming OpenClaw: Security Analysis and Mitigation of Autonomous LLM Agent Threats

EvolveReason: Self-Evolving Reasoning Paradigm for Explainable Deepfake Facial Image Identification

ProtegoFed: Backdoor-Free Federated Instruction Tuning with Interspersed Poisoned Data

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

AdapTools: Adaptive Tool-based Indirect Prompt Injection Attacks on Agentic LLMs

Automating Agent Hijacking via Structural Template Injection

VideoVeritas: AI-Generated Video Detection via Perception Pretext Reinforcement Learning

Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

VTONGuard: Automatic Detection and Authentication of AI-Generated Virtual Try-On Content

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion

Interpretable All-Type Audio Deepfake Detection with Audio LLMs via Frequency-Time Reinforcement Learning

LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis

Prefix Probing: Lightweight Harmful Content Detection for Large Language Models

Patronus: Identifying and Mitigating Transferable Backdoors in Pre-trained Language Models

MVAD : A Comprehensive Multimodal Video-Audio Dataset for AIGC Detection

TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue