Latest papers

24 papers
attack arXiv Mar 24, 2026 · 13d ago

TreeTeaming: Autonomous Red-Teaming of Vision-Language Models via Hierarchical Strategy Exploration

Chunxiao Li, Lijun Li, Jing Shao · Shanghai Artificial Intelligence Laboratory

Autonomous red-teaming framework that evolves jailbreak strategies via tree-based exploration, achieving 87.6% attack success on GPT-4o

Prompt Injection multimodalnlpvision
PDF
defense arXiv Mar 18, 2026 · 19d ago

Understanding and Defending VLM Jailbreaks via Jailbreak-Related Representation Shift

Zhihua Wei, Qiang Li, Jian Ruan et al. · Tongji University · Shanghai Artificial Intelligence Laboratory

Proposes JRS-Rem defense that prevents VLM jailbreaks by removing image-induced representation shifts toward jailbreak states at inference time

Input Manipulation Attack Prompt Injection multimodalvisionnlp
PDF Code
benchmark arXiv Feb 16, 2026 · 7w ago

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

Tianyu Chen, Dongrui Liu, Xia Hu et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Trajectory-based safety audit of Clawdbot AI agent revealing jailbreak and excessive tool-action failures across 34 test cases

Prompt Injection Excessive Agency nlp
PDF Code
benchmark arXiv Feb 3, 2026 · 8w ago

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Tianyu Chen, Chujia Hu, Ge Gao et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Benchmarks safety awareness of MCP-based LLM agents across 65 adversarial and benign long-horizon planning scenarios

Insecure Plugin Design Excessive Agency nlp
1 citations 1 influentialPDF Code
defense arXiv Jan 27, 2026 · 9w ago

RvB: Automating AI System Hardening via Iterative Red-Blue Games

Lige Huang, Zicheng Liu, Jie Zhang et al. · Shanghai Artificial Intelligence Laboratory · Institute of Information Engineering +1 more

Automates LLM jailbreak guardrail hardening via iterative red-blue adversarial game without model parameter updates

Prompt Injection nlp
PDF
defense arXiv Jan 21, 2026 · 10w ago

INFA-Guard: Mitigating Malicious Propagation via Infection-Aware Safeguarding in LLM-Based Multi-Agent Systems

Yijin Zhou, Xiaoya Lu, Dongrui Liu et al. · Shanghai Jiao Tong University · Shanghai Artificial Intelligence Laboratory +1 more

Defends LLM multi-agent systems from viral malicious propagation by detecting and rehabilitating infected agents with topological constraints

Prompt Injection Excessive Agency nlp
PDF Code
attack arXiv Jan 18, 2026 · 11w ago

DDSA: Dual-Domain Strategic Attack for Spatial-Temporal Efficiency in Adversarial Robustness Testing

Jinwei Hu, Shiyuan Meng, Yi Dong et al. · University of Liverpool · Shanghai Artificial Intelligence Laboratory

Efficient adversarial attack using XAI-guided spatial targeting and temporal frame selection to reduce per-frame robustness testing overhead

Input Manipulation Attack vision
PDF
defense arXiv Jan 15, 2026 · 11w ago

ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue, Lijun Li et al. · Peking University · Shanghai Artificial Intelligence Laboratory

Proactive step-level guardrail for LLM agent tool calls defends against malicious requests and prompt injection, cutting harmful invocations by 65%

Insecure Plugin Design Prompt Injection nlp
2 citations PDF
defense arXiv Jan 15, 2026 · 11w ago

Understanding and Preserving Safety in Fine-Tuned LLMs

Jiawen Zhang, Yangfan Hu, Kejia Chen et al. · Zhejiang University · University of Wisconsin–Madison +4 more

Preserves LLM jailbreak resistance through fine-tuning by projecting utility gradients away from the low-rank safety subspace

Transfer Learning Attack Prompt Injection nlp
PDF Code
attack arXiv Jan 13, 2026 · 11w ago

MASH: Evading Black-Box AI-Generated Text Detectors via Style Humanization

Yongtong Gu, Songze Li, Xia Hu · Southeast University · Shanghai Artificial Intelligence Laboratory

Evades black-box AI-generated text detectors via multi-stage style-transfer alignment, achieving 92% attack success rate

Output Integrity Attack nlp
PDF
tool arXiv Jan 4, 2026 · Jan 2026

OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs

Xin Wang, Yunhao Chen, Juncheng Li et al. · Shanghai Artificial Intelligence Laboratory

Open-source MLLM red-teaming framework integrating 37 attacks, revealing up to 49% ASR on frontier models including GPT-5.2 and Claude 4.5

Input Manipulation Attack Prompt Injection nlpmultimodalvision
4 citations 1 influentialPDF Code
defense arXiv Dec 8, 2025 · Dec 2025

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Fenghua Weng, Chaochao Lu, Xia Hu et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Defends VLMs against visual and contextual jailbreaks via three-stage think-reflect-revise RL safety alignment training

Prompt Injection multimodalnlp
1 citations PDF Code
attack arXiv Dec 2, 2025 · Dec 2025

Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

Yuan Xiong, Ziqi Miao, Lijun Li et al. · Shanghai Artificial Intelligence Laboratory · Xi’an Jiaotong University +1 more

Jailbreaks multimodal LLMs by embedding harmful queries in crafted visual contexts via a multi-agent image generation system

Prompt Injection visionmultimodalnlp
PDF
defense arXiv Nov 22, 2025 · Nov 2025

Consolidating Diffusion-Generated Video Detection with Unified Multimodal Forgery Learning

Xiaohong Liu, Xiufeng Song, Huayu Zheng et al. · Shanghai Jiao Tong University · IEEE +2 more

Novel multimodal detector combining ViT spatio-temporal features and MLLM reasoning to identify diffusion-generated videos

Output Integrity Attack visionmultimodal
PDF
attack arXiv Nov 16, 2025 · Nov 2025

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

Yunhao Chen, Xin Wang, Juncheng Li et al. · Fudan University · Shanghai Artificial Intelligence Laboratory

Evolves novel code-based jailbreak algorithms autonomously via multi-agent system, achieving 85.5% ASR on Claude-Sonnet-4.5

Prompt Injection nlp
1 citations PDF Code
benchmark arXiv Nov 13, 2025 · Nov 2025

Speech-Audio Compositional Attacks on Multimodal LLMs and Their Mitigation with SALMONN-Guard

Yudong Yang, Xuezhen Zhang, Zhifeng Han et al. · Tsinghua University · Shanghai Artificial Intelligence Laboratory +1 more

Black-box audio jailbreaks via speech composition bypass multimodal LLM guardrails; SALMONN-Guard cuts attack success from 66% to 20%

Prompt Injection audiomultimodalnlp
3 citations PDF Code
defense arXiv Nov 11, 2025 · Nov 2025

OmniAID: Decoupling Semantic and Artifacts for Universal AI-Generated Image Detection in the Wild

Yuncheng Guo, Junyan Ye, Chenjue Zhang et al. · Shanghai Artificial Intelligence Laboratory · Sun Yat-Sen University +2 more

Mixture-of-Experts detector decouples content-specific semantic flaws from universal artifacts to authenticate AI-generated images in the wild

Output Integrity Attack visiongenerative
2 citations PDF
benchmark arXiv Oct 23, 2025 · Oct 2025

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Chiyu Chen, Xinhao Song, Yunkai Chai et al. · Shanghai Jiao Tong University · Shanghai Artificial Intelligence Laboratory +1 more

Benchmark evaluating VLM mobile agents against environmental injection attacks via adversarial UI overlays and spoofed notifications in Android emulators

Prompt Injection Excessive Agency multimodalvision
3 citations PDF Code
attack arXiv Oct 17, 2025 · Oct 2025

HarmRLVR: Weaponizing Verifiable Rewards for Harmful LLM Alignment

Yuexiao Liu, Lijun Li, Xingjun Wang et al. · Tsinghua University · Shanghai Artificial Intelligence Laboratory

Exploits RLVR fine-tuning with 64 harmful prompts to rapidly reverse LLM safety alignment at 96% attack success rate

Transfer Learning Attack nlp
1 citations 1 influentialPDF Code
attack arXiv Oct 13, 2025 · Oct 2025

Collaborative Shadows: Distributed Backdoor Attacks in LLM-Based Multi-Agent Systems

Pengyu Zhu, Lijun Li, Yaxing Lyu et al. · Beijing University of Posts and Telecommunications · Shanghai Artificial Intelligence Laboratory +2 more

Distributed backdoor attack on LLM multi-agent systems via tool-embedded primitives activated by agent collaboration sequences

Model Poisoning Insecure Plugin Design nlp
PDF Code
Loading more papers…