ML Security Papers

Latest papers

9 papers

benchmark arXiv Feb 16, 2026 · 7w ago

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

Tianyu Chen, Dongrui Liu, Xia Hu et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Trajectory-based safety audit of Clawdbot AI agent revealing jailbreak and excessive tool-action failures across 34 test cases

Prompt Injection Excessive Agency nlp

PDF Code

defense arXiv Feb 10, 2026 · 7w ago

OSI: One-step Inversion Excels in Extracting Diffusion Watermarks

Yuwei Chen, Zhenliang He, Jia Tang et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Proposes a one-step diffusion model to extract Gaussian Shading watermarks 20x faster with higher accuracy than multi-step inversion

Output Integrity Attack generative

PDF

benchmark arXiv Feb 3, 2026 · 8w ago

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Tianyu Chen, Chujia Hu, Ge Gao et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Benchmarks safety awareness of MCP-based LLM agents across 65 adversarial and benign long-horizon planning scenarios

Insecure Plugin Design Excessive Agency nlp

1 citations 1 influentialPDF Code

attack arXiv Dec 22, 2025 · Dec 2025

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Linzhi Chen, Yang Sun, Hongru Wei et al. · ShanghaiTech University · Independent Researcher

Backdoor attack on open-weight LoRA adapters using causal-guided detoxification, cutting false trigger rates by 50–70%

Model Poisoning Transfer Learning Attack nlp

1 citations PDF

defense arXiv Dec 8, 2025 · Dec 2025

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Fenghua Weng, Chaochao Lu, Xia Hu et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Defends VLMs against visual and contextual jailbreaks via three-stage think-reflect-revise RL safety alignment training

Prompt Injection multimodalnlp

1 citations PDF Code

defense arXiv Oct 5, 2025 · Oct 2025

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Yizhuo Ding, Mingkang Chen, Qiuhua Liu et al. · Fudan University · Shanghai AI Laboratory +3 more

Defends large multimodal reasoning models against jailbreaks via multi-objective RL that jointly optimizes safety and reasoning capability

Prompt Injection multimodalnlpvisionreinforcement-learning

PDF

defense arXiv Sep 9, 2025 · Sep 2025

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

Haitao Hu, Peng Chen, Yanpeng Zhao et al. · ShanghaiTech University

Defends LLM computer-use agents from harmful autonomous tool executions via real-time operation interception and context-aware security auditing

Excessive Agency Prompt Injection nlp

PDF

attack arXiv Aug 5, 2025 · Aug 2025

BadBlocks: Lightweight and Stealthy Backdoor Threat in Text-to-Image Diffusion Models

Yu Pan, Jiahao Chen, Wenjie Wang et al. · ShanghaiTech University · Shanghai Polytechnic University +1 more

Lightweight backdoor attack on text-to-image diffusion models targeting only select UNet blocks, slashing GPU cost 5x while evading attention-based defenses

Model Poisoning visiongenerative

PDF

tool PACMI'2025 Aug 2, 2025 · Aug 2025

AgentSight: System-Level Observability for AI Agents Using eBPF

Yusheng Zheng, Yanpeng Hu, Tong Yu et al. · UC Santa Cruz · ShanghaiTech University +1 more

eBPF-based observability tool that intercepts LLM agent traffic and syscalls to detect prompt injection and resource abuse

Prompt Injection Excessive Agency nlp

PDF Code

Latest papers

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

OSI: One-step Inversion Excels in Extracting Diffusion Watermarks

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

BadBlocks: Lightweight and Stealthy Backdoor Threat in Text-to-Image Diffusion Models

AgentSight: System-Level Observability for AI Agents Using eBPF

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue