Latest papers

9 papers
benchmark arXiv Feb 16, 2026 · 7w ago

A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)

Tianyu Chen, Dongrui Liu, Xia Hu et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Trajectory-based safety audit of Clawdbot AI agent revealing jailbreak and excessive tool-action failures across 34 test cases

Prompt Injection Excessive Agency nlp
PDF Code
defense arXiv Feb 10, 2026 · 7w ago

OSI: One-step Inversion Excels in Extracting Diffusion Watermarks

Yuwei Chen, Zhenliang He, Jia Tang et al. · Institute of Computing Technology · University of Chinese Academy of Sciences +1 more

Proposes a one-step diffusion model to extract Gaussian Shading watermarks 20x faster with higher accuracy than multi-step inversion

Output Integrity Attack generative
PDF
benchmark arXiv Feb 3, 2026 · 8w ago

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios

Tianyu Chen, Chujia Hu, Ge Gao et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Benchmarks safety awareness of MCP-based LLM agents across 65 adversarial and benign long-horizon planning scenarios

Insecure Plugin Design Excessive Agency nlp
1 citations 1 influentialPDF Code
attack arXiv Dec 22, 2025 · Dec 2025

Causal-Guided Detoxify Backdoor Attack of Open-Weight LoRA Models

Linzhi Chen, Yang Sun, Hongru Wei et al. · ShanghaiTech University · Independent Researcher

Backdoor attack on open-weight LoRA adapters using causal-guided detoxification, cutting false trigger rates by 50–70%

Model Poisoning Transfer Learning Attack nlp
1 citations PDF
defense arXiv Dec 8, 2025 · Dec 2025

Think-Reflect-Revise: A Policy-Guided Reflective Framework for Safety Alignment in Large Vision Language Models

Fenghua Weng, Chaochao Lu, Xia Hu et al. · ShanghaiTech University · Shanghai Artificial Intelligence Laboratory

Defends VLMs against visual and contextual jailbreaks via three-stage think-reflect-revise RL safety alignment training

Prompt Injection multimodalnlp
1 citations PDF Code
defense arXiv Oct 5, 2025 · Oct 2025

COSMO-RL: Towards Trustworthy LMRMs via Joint Safety and Stability

Yizhuo Ding, Mingkang Chen, Qiuhua Liu et al. · Fudan University · Shanghai AI Laboratory +3 more

Defends large multimodal reasoning models against jailbreaks via multi-objective RL that jointly optimizes safety and reasoning capability

Prompt Injection multimodalnlpvisionreinforcement-learning
PDF
defense arXiv Sep 9, 2025 · Sep 2025

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

Haitao Hu, Peng Chen, Yanpeng Zhao et al. · ShanghaiTech University

Defends LLM computer-use agents from harmful autonomous tool executions via real-time operation interception and context-aware security auditing

Excessive Agency Prompt Injection nlp
PDF
attack arXiv Aug 5, 2025 · Aug 2025

BadBlocks: Lightweight and Stealthy Backdoor Threat in Text-to-Image Diffusion Models

Yu Pan, Jiahao Chen, Wenjie Wang et al. · ShanghaiTech University · Shanghai Polytechnic University +1 more

Lightweight backdoor attack on text-to-image diffusion models targeting only select UNet blocks, slashing GPU cost 5x while evading attention-based defenses

Model Poisoning visiongenerative
PDF
tool PACMI'2025 Aug 2, 2025 · Aug 2025

AgentSight: System-Level Observability for AI Agents Using eBPF

Yusheng Zheng, Yanpeng Hu, Tong Yu et al. · UC Santa Cruz · ShanghaiTech University +1 more

eBPF-based observability tool that intercepts LLM agent traffic and syscalls to detect prompt injection and resource abuse

Prompt Injection Excessive Agency nlp
PDF Code