Latest papers

5 papers
survey arXiv Mar 8, 2026 · 29d ago

From Thinker to Society: Security in Hierarchical Autonomy Evolution of AI Agents

Xiaolei Zhang, Lu Zhou, Xiaogang Xu et al. · Nanjing University of Aeronautics and Astronautics · Collaborative Innovation Center of Novel Software Technology and Industrialization +5 more

Surveys LLM agent security threats across three autonomy tiers: cognitive manipulation, tool misuse, and multi-agent systemic failures

Prompt Injection Insecure Plugin Design Excessive Agency nlp
PDF
defense arXiv Feb 3, 2026 · 8w ago

GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video

Zhenhao Zhu, Yue Liu, Yanpei Guo et al. · Tsinghua University · National University of Singapore +2 more

Reasoning-based omni-modal guardrail using SFT+GRPO to detect harmful text, image, and video LLM outputs

Prompt Injection multimodalnlpvision
PDF Code
defense arXiv Dec 5, 2025 · Dec 2025

ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior

Weikai Lu, Ziqian Zeng, Kehua Zhang et al. · South China University of Technology · Hong Kong University of Science and Technology +2 more

Defends MLLMs against multimodal indirect prompt injection by steering instruction-following behavior in activation space

Prompt Injection multimodalnlp
1 citations PDF
defense arXiv Nov 13, 2025 · Nov 2025

Do Not Merge My Model! Safeguarding Open-Source LLMs Against Unauthorized Model Merging

Qinfeng Li, Miao Pan, Jintao Chen et al. · Zhejiang University · Ningbo Global Innovation Center +2 more

Defends open-source LLMs from unauthorized model merging by disrupting Linear Mode Connectivity between homologous model weights

Model Theft Model Theft nlp
1 citations PDF
defense arXiv Nov 13, 2025 · Nov 2025

RAGFort: Dual-Path Defense Against Proprietary Knowledge Base Extraction in Retrieval-Augmented Generation

Qinfeng Li, Miao Pan, Ke Xiong et al. · Zhejiang University · Ant Group +3 more

Defends RAG systems against proprietary knowledge base extraction attacks using dual-path contrastive reindexing and constrained cascade generation

Sensitive Information Disclosure nlp
PDF Code