ML Security Papers

Latest papers

23 papers

attack arXiv Mar 29, 2026 · 10d ago

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

Duanyi Yao, Changyue Li, Zhicong Huang et al. · Hong Kong University of Science and Technology · The Chinese University of Hong Kong +2 more

Semantic backdoor attack on VLMs that injects ads when users ask recommendation questions about specific content categories

Model Poisoning multimodalvisionnlp

PDF

attack arXiv Mar 25, 2026 · 14d ago

How Vulnerable Are Edge LLMs?

Ao Ding, Hongzong Li, Zi Liang et al. · China University of Geosciences · Hong Kong University of Science and Technology +4 more

Query-based extraction attack on quantized edge LLMs using clustered instruction queries to steal model behavior efficiently

Model Theft Model Theft nlp

PDF

attack arXiv Feb 5, 2026 · 8w ago

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Yao Zhou, Zeen Song, Wenwen Qiang et al. · Institute of Software Chinese Academy of Sciences · University of Chinese Academy of Sciences +2 more

Causal front-door adjustment framework strips LLM safety features via Sparse Autoencoders to achieve state-of-the-art jailbreak success rates

Prompt Injection nlp

PDF

defense arXiv Jan 17, 2026 · 11w ago

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Zimo Ji, Daoyuan Wu, Wenyuan Jiang et al. · Hong Kong University of Science and Technology · Lingnan University +3 more

Proposes SEAgent, a mandatory access control framework that blocks privilege escalation attacks in LLM agent tool use via information flow monitoring and ABAC policies

Prompt Injection Excessive Agency nlp

1 citations PDF

defense TPAMI Jan 17, 2026 · 11w ago

A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models

Weixin Ye, Wei Wang, Yahui Liu et al. · Beijing Jiaotong University · Kuaishou +4 more

Defends against gradient inversion in federated Transformers by shuffling tokens and masking position embeddings

Model Inversion Attack visionnlpfederated-learning

PDF Code

defense arXiv Jan 8, 2026 · Jan 2026

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Han Zhu, Jiale Chen, Chengkun Cai et al. · Hong Kong University of Science and Technology · Sun Yat-Sen University +3 more

GRPO-based safety alignment framework defending MLLMs against multi-turn jailbreaks via dataset and turn-aware dual-objective rewards

Prompt Injection multimodalnlp

PDF

attack arXiv Jan 2, 2026 · Jan 2026

Low Rank Comes with Low Security: Gradient Assembly Poisoning Attacks against Distributed LoRA-based LLM Systems

Yueyan Dong, Minghui Xu, Qin Hu et al. · Shandong University · Guangdong University of Finance and Economics +2 more

Exploits LoRA's decoupled A/B matrix aggregation in federated LLM fine-tuning to inject stealthy malicious updates that degrade model quality while evading anomaly detectors

Data Poisoning Attack Transfer Learning Attack nlpfederated-learning

PDF

attack arXiv Dec 30, 2025 · Dec 2025

RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress

Ruixuan Huang, Qingyue Wang, Hantao Huang et al. · Hong Kong University of Science and Technology · Nanyang Technological University

Black-box DoS attack exploits MoE router imbalance via repetitive token patterns, causing 3x latency spike on Mixtral-8x7B

Model Denial of Service nlp

PDF

attack arXiv Dec 26, 2025 · Dec 2025

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Zongmin Zhang, Zhen Sun, Yifan Liao et al. · Hong Kong University of Science and Technology · Nanjing University of Aeronautics and Astronautics +2 more

Proposes BadVSFM, a two-stage backdoor attack on prompt-driven video segmentation models where classic backdoors fail (<5% ASR)

Model Poisoning vision

PDF

defense arXiv Dec 7, 2025 · Dec 2025

AlignGemini: Generalizable AI-Generated Image Detection Through Task-Model Alignment

Ruoxin Chen, Jiahui Gao, Kaiqing Lin et al. · Tencent · East China University of Science and Technology +2 more

Proposes task-model alignment combining VLMs and vision models for generalizable AI-generated image detection

Output Integrity Attack visionmultimodal

PDF

defense arXiv Dec 5, 2025 · Dec 2025

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu, Ziqian Zeng, Kehua Zhang et al. · South China University of Technology · Hong Kong University of Science and Technology +2 more

Defends MLLMs against multimodal indirect prompt injection by steering instruction-following behavior in activation space

Prompt Injection multimodalnlp

1 citations PDF

defense arXiv Nov 27, 2025 · Nov 2025

RemedyGS: Defend 3D Gaussian Splatting against Computation Cost Attacks

Yanping Li, Zhening Liu, Zijian Li et al. · Hong Kong University of Science and Technology

Defends 3D Gaussian Splatting reconstruction against adversarial texture-poisoning DoS attacks via an ML-based detect-and-purify pipeline

Input Manipulation Attack vision

1 citations PDF

attack arXiv Nov 17, 2025 · Nov 2025

VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

Zonghao Ying, Moyang Chen, Nizhang Li et al. · Beihang University · Wenzhou-Kean University +4 more

Jailbreaks text-to-video models using benign prompts with auditory triggers and cinematic cues that exploit cross-modal priors

Prompt Injection multimodalgenerativevisionnlp

1 citations PDF Code

benchmark arXiv Nov 13, 2025 · Nov 2025

Phantom Menace: Exploring and Enhancing the Robustness of VLA Models Against Physical Sensor Attacks

Xuancun Lu, Jiaxiang Chen, Shilin Xiao et al. · Zhejiang University · Hong Kong University of Science and Technology

Benchmarks physical sensor attacks (laser, EMI, ultrasound) against VLA robotic models and defends with adversarial training

Input Manipulation Attack multimodalvisionaudio

PDF Code

defense arXiv Nov 11, 2025 · Nov 2025

3D Guard-Layer: An Integrated Agentic AI Safety System for Edge Artificial Intelligence

Eren Kurshan, Yuan Xie, Paul Franzon · Princeton University · Hong Kong University of Science and Technology +1 more

Proposes 3D-integrated hardware safety layer for edge AI systems that dynamically detects and mitigates inference-time network attacks

Input Manipulation Attack Excessive Agency visionnlp

PDF

benchmark arXiv Nov 10, 2025 · Nov 2025

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Yilin Jiang, Mingzi Zhang, Xuanyu Yin et al. · Zhejiang University of Technology · Hong Kong University of Science and Technology +3 more

Benchmark evaluating teacher-persona jailbreaks on LLMs, revealing a scaling paradox where mid-sized models are most vulnerable

Prompt Injection nlp

PDF Code

attack arXiv Oct 21, 2025 · Oct 2025

FeatureFool: Zero-Query Fooling of Video Models via Feature Map

Duoxun Tang, Xi Xiao, Guangwu Hu et al. · Tsinghua University · Shenzhen University of Information Technology +4 more

Zero-query black-box adversarial video attack using guided backpropagation feature maps to fool classifiers and bypass Video-LLM harmful content detection

Input Manipulation Attack Prompt Injection visionmultimodal

1 citations PDF

benchmark arXiv Oct 14, 2025 · Oct 2025

SafeMT: Multi-turn Safety for Multimodal Language Models

Han Zhu, Juntao Dai, Jiaming Ji et al. · Hong Kong University of Science and Technology · Peking University +1 more

Benchmarks multi-turn jailbreak safety of 17 multimodal LLMs and proposes a dialogue safety moderator to reduce attack success rates

Prompt Injection multimodalnlp

3 citations PDF

defense arXiv Oct 9, 2025 · Oct 2025

Provably Robust Adaptation for Language-Empowered Foundation Models

Yuni Lai, Xiaoyu Xue, Linghui Shen et al. · The Hong Kong Polytechnic University · National University of Defense Technology +2 more

Certifiably robust few-shot classifier for CLIP/GraphCLIP using trimmed-mean prototypes and randomized smoothing against support-set poisoning

Data Poisoning Attack visiongraphmultimodal

1 citations PDF

Language-empowered foundation models (LeFMs), such as CLIP and GraphCLIP, have transformed multimodal learning by aligning visual (or graph) features with textual representations, enabling powerful downstream capabilities like few-shot learning. However, the reliance on small, task-specific support datasets collected in open environments exposes these models to poisoning attacks, where adversaries manipulate the support samples to degrade performance. Existing defenses rely on empirical strategies, which lack formal guarantees and remain vulnerable to unseen and adaptive attacks. Certified robustness offers provable guarantees but has been largely unexplored for few-shot classifiers based on LeFMs. This study seeks to fill these critical gaps by proposing the first provably robust few-shot classifier that is tailored for LeFMs. We term our model Language-empowered Few-shot Certification (\textbf{LeFCert}). It integrates both textual and feature embeddings with an adaptive blending mechanism. To achieve provable robustness, we propose a twofold trimmed mean prototype and derive provable upper and lower bounds for classification scores, enabling certification under worst-case poisoning scenarios. To further enhance the performance, we extend LeFCert with two variants by considering a more realistic and tighter attack budget: LeFCert-L incorporates randomized smoothing to provide Lipschitz continuity and derive robustness under dual budget constraints, and LeFCert-C provides collective certification for scenarios where attackers distribute a shared poisoning budget across multiple samples. Experiments demonstrate that LeFCert achieves state-of-the-art performance, significantly improving both clean and certified accuracy compared to existing baselines. Despite its advanced robustness mechanisms, LeFCert is computationally efficient, making it practical for real-world applications.

vlm transformer multimodal The Hong Kong Polytechnic University · National University of Defense Technology · Shanghai Jiao Tong University +1 more

PDF arXiv DOI

benchmark arXiv Oct 8, 2025 · Oct 2025

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Weidi Luo, Qiming Zhang, Tianyu Lu et al. · University of Georgia · University of Wisconsin–Madison +6 more

Benchmarks LLM-powered agents' ability to execute end-to-end enterprise intrusions aligned with MITRE ATT&CK TTPs

Excessive Agency Prompt Injection nlpmultimodal

4 citations PDF Code

Loading more papers…

Latest papers

Hidden Ads: Behavior Triggered Semantic Backdoors for Advertisement Injection in Vision Language Models

How Vulnerable Are Edge LLMs?

Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

A Unified Masked Jigsaw Puzzle Framework for Vision and Language Models

AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

Low Rank Comes with Low Security: Gradient Assembly Poisoning Attacks against Distributed LoRA-based LLM Systems

RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

AlignGemini: Generalizable AI-Generated Image Detection Through Task-Model Alignment

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

RemedyGS: Defend 3D Gaussian Splatting against Computation Cost Attacks

VEIL: Jailbreaking Text-to-Video Models via Visual Exploitation from Implicit Language

Phantom Menace: Exploring and Enhancing the Robustness of VLA Models Against Physical Sensor Attacks

3D Guard-Layer: An Integrated Agentic AI Safety System for Edge Artificial Intelligence

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

FeatureFool: Zero-Query Fooling of Video Models via Feature Map

SafeMT: Multi-turn Safety for Multimodal Language Models

Provably Robust Adaptation for Language-Empowered Foundation Models

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue