Latest papers

13 papers
defense arXiv Mar 26, 2026 · 11d ago

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology

Real-time monitor that detects adversarial manipulation of LLM chain-of-thought reasoning via step-level analysis and error classification

Prompt Injection Model Denial of Service nlp
PDF
benchmark ACM MM Feb 11, 2026 · 7w ago

RealHD: A High-Quality Dataset for Robust Detection of State-of-the-Art AI-Generated Images

Hanzhe Yu, Yun Ye, Jintao Rong et al. · Zhejiang University of Technology · Intel Corporation +1 more

Proposes RealHD, a 730K-image benchmark dataset and NLM noise-entropy detector for robust AI-generated image detection

Output Integrity Attack visiongenerative
PDF Code
defense arXiv Jan 17, 2026 · 11w ago

Taming Various Privilege Escalation in LLM-Based Agent Systems: A Mandatory Access Control Framework

Zimo Ji, Daoyuan Wu, Wenyuan Jiang et al. · Hong Kong University of Science and Technology · Lingnan University +3 more

Proposes SEAgent, a mandatory access control framework that blocks privilege escalation attacks in LLM agent tool use via information flow monitoring and ABAC policies

Prompt Injection Excessive Agency nlp
1 citations PDF
defense arXiv Jan 13, 2026 · 11w ago

ForgetMark: Stealthy Fingerprint Embedding via Targeted Unlearning in Language Models

Zhenhua Xu, Haobo Zhang, Zhebo Wang et al. · Zhejiang University · GenTel.io +1 more

Fingerprints LLMs for ownership verification using targeted unlearning to embed stealthy, trigger-free provenance traces

Model Theft nlp
2 citations PDF Code
attack arXiv Dec 24, 2025 · Dec 2025

Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks

Xinjie Xu, Shuyu Cheng, Dongwei Xu et al. · Zhejiang University of Technology · Binjiang Institute of Artificial Intelligence +1 more

Momentum-based hard-label black-box attack using Nesterov acceleration achieves O(1/T²) convergence, outperforming 13 SOTA methods on ImageNet and CIFAR-10.

Input Manipulation Attack vision
PDF Code
survey arXiv Nov 19, 2025 · Nov 2025

Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks

Zimo Ji, Xunguang Wang, Zongjie Li et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology +3 more

SoK paper taxonomizes IPI defenses for LLM agents, identifies six bypass root causes, and proposes three novel adaptive attacks

Prompt Injection nlp
PDF
defense arXiv Nov 15, 2025 · Nov 2025

MPD-SGR: Robust Spiking Neural Networks with Membrane Potential Distribution-Driven Surrogate Gradient Regularization

Runhao Jiang, Chengzhi Jiang, Rui Yan et al. · Zhejiang University · Zhejiang University of Technology

Defends spiking neural networks against adversarial attacks by regularizing membrane potential distribution to reduce gradient sensitivity

Input Manipulation Attack vision
2 citations PDF
defense CCS Nov 11, 2025 · Nov 2025

Provable Repair of Deep Neural Network Defects by Preimage Synthesis and Property Refinement

Jianan Ma, Jingyi Wang, Qi Xuan et al. · Hangzhou Dianzi University · Zhejiang University +1 more

Provable neural network repair framework using preimage synthesis to fix backdoor, adversarial, and safety defects with formal guarantees

Model Poisoning Input Manipulation Attack vision
PDF Code
benchmark arXiv Nov 10, 2025 · Nov 2025

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Yilin Jiang, Mingzi Zhang, Xuanyu Yin et al. · Zhejiang University of Technology · Hong Kong University of Science and Technology +3 more

Benchmark evaluating teacher-persona jailbreaks on LLMs, revealing a scaling paradox where mid-sized models are most vulnerable

Prompt Injection nlp
PDF Code
benchmark arXiv Oct 26, 2025 · Oct 2025

OFFSIDE: Benchmarking Unlearning Misinformation in Multimodal Large Language Models

Hao Zheng, Zirui Pang, Ling li et al. · Harbin Institute of Technology · University of Illinois Urbana-Champaign +5 more

Benchmarks MLLM unlearning and reveals all methods leak supposedly-erased misinformation via adversarial recovery and prompt attacks

Sensitive Information Disclosure Prompt Injection multimodalnlpvision
PDF Code
benchmark arXiv Oct 11, 2025 · Oct 2025

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

Zonghao Ying, Yangguang Shao, Jianle Gan et al. · Beihang University · Chinese Academy of Sciences +7 more

Benchmark evaluating LVLM web agent security across six attack vectors in realistic web environments, exposing universal vulnerabilities across 9 models

Prompt Injection Excessive Agency multimodalnlp
5 citations PDF
tool arXiv Sep 30, 2025 · Sep 2025

LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models

Guolei Huang, Qinzhi Peng, Gan Xu et al. · Southeast University · RealAI +3 more

Builds a VLM content moderation tool and MCTS red-teaming framework for detecting harmful multi-turn multimodal dialogues

Prompt Injection multimodalnlp
1 citations PDF
benchmark arXiv Aug 21, 2025 · Aug 2025

EMNLP: Educator-role Moral and Normative Large Language Models Profiling

Yilin Jiang, Mingzi Zhang, Sheng Jin et al. · Zhejiang University of Technology · The Hong Kong University of Science and Technology +2 more

Benchmarks teacher-role LLM safety via personality profiling, moral dilemmas, and soft prompt injection vulnerability testing across 14 models

Prompt Injection nlp
PDF Code