Latest papers

9 papers
benchmark arXiv Mar 16, 2026 · 21d ago

TrinityGuard: A Unified Framework for Safeguarding Multi-Agent Systems

Kai Wang, Biaojie Zeng, Zeming Wei et al. · Shanghai AI Laboratory

Comprehensive safety framework evaluating 20 risk types across LLM multi-agent systems with runtime monitoring and OWASP-grounded taxonomy

Prompt Injection Excessive Agency nlpmultimodal
PDF
defense arXiv Feb 4, 2026 · 8w ago

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Zeming Wei, Qiaosheng Zhang, Xia Hu et al. · Shanghai AI Laboratory · Peking University

Risk-aware preference optimization framework that generalizes LRM safe reasoning against diverse jailbreak attacks without sacrificing utility

Prompt Injection nlp
PDF Code
defense arXiv Feb 2, 2026 · 9w ago

MAGIC: A Co-Evolving Attacker-Defender Adversarial Game for Robust LLM Safety

Xiaoyu Wen, Zhida He, Han Qi et al. · Shanghai AI Laboratory · Shanghai Jiao Tong University +1 more

Multi-agent RL co-evolves an LLM attacker and defender, generating novel jailbreaks to train robust safety alignment against unseen prompts

Prompt Injection nlpreinforcement-learning
PDF Code
benchmark arXiv Dec 17, 2025 · Dec 2025

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Xuanjun Zong, Zhiqi Shen, Lei Wang et al. · East China Normal University · Salesforce AI Research +2 more

Benchmark of 20 MCP attack types across 5 real-world domains revealing escalating LLM agent safety gaps in multi-step tool-use workflows

Insecure Plugin Design Excessive Agency nlp
4 citations PDF Code
tool arXiv Dec 13, 2025 · Dec 2025

UniMark: Artificial Intelligence Generated Content Identification Toolkit

Meilin Li, Ji He, Yi Yu et al. · Shanghai AI Laboratory · Shandong University +1 more

Unified open-source toolkit for multimodal AIGC governance via hidden watermarking and visible compliance marking

Output Integrity Attack multimodalnlpvisionaudio
PDF Code
benchmark arXiv Dec 1, 2025 · Dec 2025

TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?

Lewen Yan, Jilin Mei, Tianyi Zhou et al. · Shanghai AI Laboratory

Benchmark framework revealing catastrophic failures in LLM trading agents via targeted perturbations across four agent decision-loop components

Excessive Agency Prompt Injection nlp
1 citations PDF Code
defense arXiv Nov 24, 2025 · Nov 2025

ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Ruize Ma, Minghong Cai, Yilei Jiang et al. · The Chinese University of Hong Kong · Nanjing University +2 more

Proactive multimodal safety guardrail for video generation that detects unsafe text+image prompts and suppresses harmful concept generation

Prompt Injection multimodalgenerativevision
1 citations 1 influentialPDF Code
defense arXiv Nov 10, 2025 · Nov 2025

LiteUpdate: A Lightweight Framework for Updating AI-Generated Image Detectors

Jiajie Lu, Zhenkan Fu, Na Zhao et al. · University of Science and Technology of China · Shanghai AI Laboratory

Proposes LiteUpdate to efficiently update AI-generated image detectors against new generators while preventing catastrophic forgetting

Output Integrity Attack visiongenerative
1 citations PDF
defense arXiv Oct 5, 2025 · Oct 2025

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Yizhuo Ding, Mingkang Chen, Qiuhua Liu et al. · Fudan University · Shanghai AI Laboratory +3 more

Defends large multimodal reasoning models against jailbreaks via multi-objective RL that jointly optimizes safety and reasoning capability

Prompt Injection multimodalnlpvisionreinforcement-learning
PDF