ML Security Papers

Latest papers

12 papers

survey arXiv Mar 8, 2026 · 4w ago

From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents

Xiaolei Zhang, Lu Zhou, Xiaogang Xu et al. · Nanjing University of Aeronautics and Astronautics · Collaborative Innovation Center of Novel Software Technology and Industrialization +5 more

Surveys LLM agent security threats across three autonomy tiers: cognitive manipulation, tool misuse, and multi-agent systemic failures

Prompt Injection Insecure Plugin Design Excessive Agency nlp

PDF

defense arXiv Mar 5, 2026 · 4w ago

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

Lianyu Wang, Meng Wang, Huazhu Fu et al. · Nanjing University of Aeronautics and Astronautics · Southeast University +1 more

Defends VLM intellectual property via dynamic authorization module restricting deployment to user-specified domains at inference time

Model Theft visionnlpmultimodal

PDF

attack arXiv Jan 24, 2026 · 10w ago

Physical Prompt Injection Attacks on Large Vision-Language Models

Chen Ling, Kai Hu, Hangcheng Liu et al. · Wuhan University · Nanyang Technological University +1 more

Embeds malicious typographic instructions in physical objects to inject prompts into VLMs, achieving up to 98% attack success across 10 models

Input Manipulation Attack Prompt Injection visionmultimodal

PDF Code

defense arXiv Jan 23, 2026 · 10w ago

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

Xianya Fang, Xianying Luo, Yadong Wang et al. · Nanjing University of Aeronautics and Astronautics · Tsinghua University +3 more

Adaptive three-stage LLM defense routes inputs by risk level to counter jailbreaks and prefilling attacks without sacrificing utility

Prompt Injection nlp

PDF

attack arXiv Jan 7, 2026 · Jan 2026

State Backdoor: Towards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space

Ji Guo, Wenbo Jiang, Yansong Lin et al. · University of Electronic Science and Technology of China · Nanyang Technological University +1 more

Backdoor attack on VLA robotics models using robot arm initial state as trigger, achieving >90% attack success rate stealthily

Model Poisoning Data Poisoning Attack visionmultimodal

1 citations PDF

attack arXiv Dec 26, 2025 · Dec 2025

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Zongmin Zhang, Zhen Sun, Yifan Liao et al. · Hong Kong University of Science and Technology · Nanjing University of Aeronautics and Astronautics +2 more

Proposes BadVSFM, a two-stage backdoor attack on prompt-driven video segmentation models where classic backdoors fail (<5% ASR)

Model Poisoning vision

PDF

attack arXiv Nov 17, 2025 · Nov 2025

Enhancing All-to-X Backdoor Attacks with Optimized Target Class Mapping

Lei Wang, Yulong Tian, Hao Han et al. · Nanjing University of Aeronautics and Astronautics · Nanjing University

Enhances multi-target backdoor attacks by optimizing source-to-target class grouping, achieving up to 28% higher attack success rates while evading defenses

Model Poisoning vision

PDF Code

defense arXiv Oct 31, 2025 · Oct 2025

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

Qinghong Yin, Yu Tian, Heming Yang et al. · Beijing University of Posts and Telecommunications · Tsinghua University +2 more

Semantics-guided adversarial training makes diffusion model concept erasure robust against adversarial prompt bypass attacks, cutting training time 90%

Input Manipulation Attack generativevision

PDF Code

attack arXiv Oct 9, 2025 · Oct 2025

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Haoran Ou, Kangjie Chen, Xingshuo Han et al. · Nanyang Technological University · Nanjing University of Aeronautics and Astronautics +2 more

Red-teams web-augmented LLMs with benign-looking search queries that bypass safety filters and force harmful content citations

Prompt Injection nlp

1 citations PDF

defense arXiv Sep 14, 2025 · Sep 2025

Make Identity Unextractable yet Perceptible: Synthesis-Based Privacy Protection for Subject Faces in Photos

Tao Wang, Yushu Zhang, Xiangli Xiao et al. · Nanjing University of Aeronautics and Astronautics · Jiangxi University of Finance and Economics +1 more

Synthesis-based anti-face-recognition defense generates perceptible yet identity-unextractable faces to defeat unauthorized FR systems

Input Manipulation Attack visiongenerative

PDF Code

Deep learning-based face recognition (FR) technology exacerbates privacy concerns in photo sharing. In response, the research community developed a suite of anti-FR methods to block identity extraction by unauthorized FR systems. Benefiting from quasi-imperceptible alteration, perturbation-based methods are well-suited for privacy protection of subject faces in photos, as they allow familiar persons to recognize subjects via naked eyes. However, we reveal that perturbation-based methods provide a false sense of privacy through theoretical analysis and experimental validation. Therefore, new alternative solutions should be found to protect subject faces. In this paper, we explore synthesis-based methods as a promising solution, whose challenge is to enable familiar persons to recognize subjects. To solve the challenge, we present a key insight: In most photo sharing scenarios, familiar persons recognize subjects through identity perception rather than meticulous face analysis. Based on the insight, we propose the first synthesis-based method dedicated to subject faces, i.e., PerceptFace, which can make identity unextractable yet perceptible. To enhance identity perception, a new perceptual similarity loss is designed for faces, reducing the alteration in regions of high sensitivity to human vision. As a synthesis-based method, PerceptFace can inherently provide reliable identity protection. Meanwhile, out of the confine of meticulous face analysis, PerceptFace focuses on identity perception from a more practical scenario, which is also enhanced by the designed perceptual similarity loss. Sufficient experiments show that PerceptFace achieves a superior trade-off between identity protection and identity perception compared to existing methods. We provide a public API of PerceptFace and believe that it has great potential to become a practical anti-FR tool.

cnn gan Nanjing University of Aeronautics and Astronautics · Jiangxi University of Finance and Economics · Chongqing University of Posts and Telecommunications

PDF arXiv Code

attack arXiv Aug 14, 2025 · Aug 2025

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

Chiyu Zhang, Lu Zhou, Xiaogang Xu et al. · Nanjing University of Aeronautics and Astronautics · Collaborative Innovation Center of Novel Software Technology and Industrialization +2 more

Novel black-box jailbreak attack combining adversarial context alignment and fake chain-of-thought to bypass reasoning LLM safety guardrails

Prompt Injection nlp

PDF

defense arXiv Aug 5, 2025 · Aug 2025

BDFirewall: Towards Effective and Expeditiously Black-Box Backdoor Defense in MLaaS

Ye Li, Chengcheng Zhu, Yanchao Zhao et al. · Nanjing University of Aeronautics and Astronautics · Nanjing University +1 more

Defends against backdoor attacks in black-box MLaaS by progressively purging HVT, SVT, and LVT triggers at inference time

Model Poisoning vision

PDF

Latest papers

From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

Physical Prompt Injection Attacks on Large Vision-Language Models

SafeThinker: Reasoning about Risk to Deepen Safety Beyond Shallow Alignment

State Backdoor: Towards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space

Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Enhancing All-to-X Backdoor Attacks with Optimized Target Class Mapping

Rethinking Robust Adversarial Concept Erasure in Diffusion Models

When Search Goes Wrong: Red-Teaming Web-Augmented Large Language Models

Make Identity Unextractable yet Perceptible: Synthesis-Based Privacy Protection for Subject Faces in Photos

Jailbreaking Commercial Black-Box LLMs with Explicitly Harmful Prompts

BDFirewall: Towards Effective and Expeditiously Black-Box Backdoor Defense in MLaaS

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue