ML Security Papers

Latest papers

7 papers

defense arXiv Feb 24, 2026 · 5w ago

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Che Wang, Fuyao Zhang, Jiaming Zhang et al. · Peking University · Nanyang Technological University +2 more

Defends LLM agents against indirect prompt injection via latent-space probing and attention steering without over-refusal

Prompt Injection nlpmultimodal

PDF

defense arXiv Feb 21, 2026 · 6w ago

Watermarking LLM Agent Trajectories

Wenlong Meng, Chen Gong, Terry Yue Zhuo et al. · Zhejiang University · University of Virginia +2 more

Watermarks LLM agent training trajectories so models trained on stolen datasets emit detectable hook behaviors under a secret key

Output Integrity Attack nlpreinforcement-learning

PDF Code

benchmark arXiv Dec 6, 2025 · Dec 2025

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

Xiaojun Jia, Jie Liao, Qi Guo et al. · Nanyang Technological University · BraneMatrix AI +7 more

Unified benchmark and toolbox evaluating 13 attack methods and 15 defenses against multimodal jailbreaks across 18 open- and closed-source MLLMs

Prompt Injection multimodalnlpvision

5 citations PDF Code

defense arXiv Oct 10, 2025 · Oct 2025

SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG

Xiaonan Si, Meilin Zhu, Simeng Qin et al. · Institute of Software · University of Chinese Academy of Sciences +5 more

Defends RAG systems from corpus poisoning via two-stage semantic and conflict-aware filtering before LLM generation

Prompt Injection nlp

2 citations PDF

survey arXiv Sep 16, 2025 · Sep 2025

Beyond Data Privacy: New Privacy Risks for Large Language Models

Yuntao Du, Zitao Li, Ninghui Li et al. · Purdue University · Alibaba

Surveys deployment-phase privacy attack risks for LLMs beyond training data: exfiltration, attribute inference, and agentic weaponization

Sensitive Information Disclosure Excessive Agency nlp

PDF

defense arXiv Aug 19, 2025 · Aug 2025

MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment

Shengchao Liu, Xiaoming Liu, Chengzhengxu Li et al. · Xi’an Jiaotong University · Queen Mary University of London +1 more

Novel frequency-domain detector for AI-generated text that aligns spectral features across domains to generalize beyond training distribution

Output Integrity Attack nlp

PDF

tool arXiv Aug 17, 2025 · Aug 2025

MIRAGE: Towards AI-Generated Image Detection in the Wild

Cheng Xia, Manxi Lin, Jiexiang Tan et al. · Alibaba

Proposes VLM-based AI-generated image detector with reflective RL reasoning and benchmark for in-the-wild detection

Output Integrity Attack vision

PDF

Latest papers

ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction

Watermarking LLM Agent Trajectories

OmniSafeBench-MM: A Unified Benchmark and Toolbox for Multimodal Jailbreak Attack-Defense Evaluation

SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG

Beyond Data Privacy: New Privacy Risks for Large Language Models

MGT-Prism: Enhancing Domain Generalization for Machine-Generated Text Detection via Spectral Alignment

MIRAGE: Towards AI-Generated Image Detection in the Wild

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue