ML Security Papers

Latest papers

25 papers

benchmark arXiv Apr 21, 2026 · 4w ago

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

Kun Wang, Cheng Qian, Miao Yu et al. · Nanyang Technological University · University of Science and Technology of China +3 more

Interpretability framework revealing that MLLM backdoors encode in low-rank projector subspaces with norm-scaled activation mechanisms

Model Poisoning multimodalnlpvision

PDF Code

attack arXiv Mar 24, 2026 · 8w ago

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

Chunxiao Li, Lijun Li, Jing Shao · Shanghai Artificial Intelligence Laboratory

Autonomous red-teaming framework that evolves jailbreak strategies via tree-based exploration, achieving 87.6% attack success on GPT-4o

Prompt Injection multimodalnlpvision

PDF

defense arXiv Mar 18, 2026 · 9w ago

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Zhihua Wei, Qiang Li, Jian Ruan et al. · Tongji University · Shanghai Artificial Intelligence Laboratory

Proposes JRS-Rem defense that prevents VLM jailbreaks by removing image-induced representation shifts toward jailbreak states at inference time

Input Manipulation Attack Prompt Injection multimodalvisionnlp

PDF Code

benchmark arXiv Feb 16, 2026 · Feb 2026

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

Tianyu Chen, Dongrui Liu, Xia Hu et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Trajectory-based safety audit of Clawdbot AI agent revealing jailbreak and excessive tool-action failures across 34 test cases

Prompt Injection Excessive Agency nlp

PDF Code

benchmark arXiv Feb 3, 2026 · Feb 2026

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Tianyu Chen, Chujia Hu, Ge Gao et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Benchmarks safety awareness of MCP-based LLM agents across 65 adversarial and benign long-horizon planning scenarios

Insecure Plugin Design Excessive Agency nlp

1 citations 1 influentialPDF Code

defense arXiv Jan 27, 2026 · Jan 2026

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Lige Huang, Zicheng Liu, Jie Zhang et al. · Shanghai Artificial Intelligence Laboratory · Institute of Information Engineering +1 more

Automates LLM jailbreak guardrail hardening via iterative red-blue adversarial game without model parameter updates

Prompt Injection Red-Team Agents Patch & Remediation Blue-Team Agents nlp

PDF

defense arXiv Jan 21, 2026 · Jan 2026

INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

Yijin Zhou, Xiaoya Lu, Dongrui Liu et al. · Shanghai Jiao Tong University · Shanghai Artificial Intelligence Laboratory +1 more

Defends LLM multi-agent systems from viral malicious propagation by detecting and rehabilitating infected agents with topological constraints

Prompt Injection Excessive Agency nlp

PDF Code

attack arXiv Jan 18, 2026 · Jan 2026

DDSA: Dual-Domain Strategic Attack for Spatial-Temporal Efficiency in Adversarial Robustness Testing

Jinwei Hu, Shiyuan Meng, Yi Dong et al. · University of Liverpool · Shanghai Artificial Intelligence Laboratory

Efficient adversarial attack using XAI-guided spatial targeting and temporal frame selection to reduce per-frame robustness testing overhead

Input Manipulation Attack vision

PDF

defense arXiv Jan 15, 2026 · Jan 2026

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue, Lijun Li et al. · Peking University · Shanghai Artificial Intelligence Laboratory

Proactive step-level guardrail for LLM agent tool calls defends against malicious requests and prompt injection, cutting harmful invocations by 65%

Insecure Plugin Design Prompt Injection Blue-Team Agents nlp

2 citations PDF

defense arXiv Jan 15, 2026 · Jan 2026

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen et al. · Zhejiang University · University of Wisconsin–Madison +4 more

Preserves LLM jailbreak resistance through fine-tuning by projecting utility gradients away from the low-rank safety subspace

Transfer Learning Attack Prompt Injection nlp

PDF Code

attack arXiv Jan 13, 2026 · Jan 2026

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Yongtong Gu, Songze Li, Xia Hu · Southeast University · Shanghai Artificial Intelligence Laboratory

Evades black-box AI-generated text detectors via multi-stage style-transfer alignment, achieving 92% attack success rate

Output Integrity Attack nlp

PDF

tool arXiv Jan 4, 2026 · Jan 2026

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Xin Wang, Yunhao Chen, Juncheng Li et al. · Shanghai Artificial Intelligence Laboratory

Open-source MLLM red-teaming framework integrating 37 attacks, revealing up to 49% ASR on frontier models including GPT-5.2 and Claude 4.5

Input Manipulation Attack Prompt Injection Red-Team Agents Benchmarks & Evaluation nlpmultimodalvision

4 citations 1 influentialPDF Code

defense arXiv Dec 8, 2025 · Dec 2025

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Fenghua Weng, Chaochao Lu, Xia Hu et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Defends VLMs against visual and contextual jailbreaks via three-stage think-reflect-revise RL safety alignment training

Prompt Injection multimodalnlp

1 citations PDF Code

attack arXiv Dec 2, 2025 · Dec 2025

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Yuan Xiong, Ziqi Miao, Lijun Li et al. · Shanghai Artificial Intelligence Laboratory · Xi’an Jiaotong University +1 more

Jailbreaks multimodal LLMs by embedding harmful queries in crafted visual contexts via a multi-agent image generation system

Prompt Injection visionmultimodalnlp

PDF

defense arXiv Nov 22, 2025 · Nov 2025

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Xiaohong Liu, Xiufeng Song, Huayu Zheng et al. · Shanghai Jiao Tong University · IEEE +2 more

Novel multimodal detector combining ViT spatio-temporal features and MLLM reasoning to identify diffusion-generated videos

Output Integrity Attack visionmultimodal

PDF

attack arXiv Nov 16, 2025 · Nov 2025

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Yunhao Chen, Xin Wang, Juncheng Li et al. · Fudan University · Shanghai Artificial Intelligence Laboratory

Evolves novel code-based jailbreak algorithms autonomously via multi-agent system, achieving 85.5% ASR on Claude-Sonnet-4.5

Prompt Injection Red-Team Agents nlp

1 citations PDF Code

benchmark arXiv Nov 13, 2025 · Nov 2025

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Yudong Yang, Xuezhen Zhang, Zhifeng Han et al. · Tsinghua University · Shanghai Artificial Intelligence Laboratory +1 more

Black-box audio jailbreaks via speech composition bypass multimodal LLM guardrails; SALMONN-Guard cuts attack success from 66% to 20%

Prompt Injection audiomultimodalnlp

3 citations PDF Code

defense arXiv Nov 11, 2025 · Nov 2025

OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

Yuncheng Guo, Junyan Ye, Chenjue Zhang et al. · Shanghai Artificial Intelligence Laboratory · Sun Yat-Sen University +2 more

Mixture-of-Experts detector decouples content-specific semantic flaws from universal artifacts to authenticate AI-generated images in the wild

Output Integrity Attack visiongenerative

2 citations PDF

benchmark arXiv Oct 23, 2025 · Oct 2025

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Chiyu Chen, Xinhao Song, Yunkai Chai et al. · Shanghai Jiao Tong University · Shanghai Artificial Intelligence Laboratory +1 more

Benchmark evaluating VLM mobile agents against environmental injection attacks via adversarial UI overlays and spoofed notifications in Android emulators

Prompt Injection Excessive Agency multimodalvision

3 citations PDF Code

attack arXiv Oct 17, 2025 · Oct 2025

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

Yuexiao Liu, Lijun Li, Xingjun Wang et al. · Tsinghua University · Shanghai Artificial Intelligence Laboratory

Exploits RLVR fine-tuning with 64 harmful prompts to rapidly reverse LLM safety alignment at 96% attack success rate

Transfer Learning Attack nlp

1 citations 1 influentialPDF Code

Loading more papers…

Latest papers

ProjLens: Unveiling the Role of Projectors in Multimodal Model Safety

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

RvB: Automating AI System Hardening via Iterative Red-Blue Games

INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

DDSA: Dual-Domain Strategic Attack for Spatial-Temporal Efficiency in Adversarial Robustness Testing

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Understanding and Preserving Safety in Fine-Tuned LLMs

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue