Shuai Wang

defense arXiv Feb 9, 2026 · 8w ago

On Protecting Agentic Systems' Intellectual Property via Watermarking

Liwen Wang, Zongjie Li, Yuchong Xie et al. · The Hong Kong University of Science and Technology · HSBC

Watermarks agentic LLM systems by biasing tool execution paths, so stolen imitation models inherit detectable signatures

Model Theft Model Theft nlp

PDF

attack arXiv Aug 27, 2025 · Aug 2025

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Yanbo Dai, Zhenlan Ji, Zongjie Li et al. · The Hong Kong University of Science and Technology

Backdoors RAG retrievers via model editing to inject anti-self-correction instructions, achieving >90% attack success across 6 LLMs

Model Poisoning Prompt Injection nlp

PDF

defense arXiv Mar 26, 2026 · 11d ago

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology

Real-time monitor that detects adversarial manipulation of LLM chain-of-thought reasoning via step-level analysis and error classification

Prompt Injection Model Denial of Service nlp

PDF

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety--detecting harmful, biased, or factually incorrect outputs -- and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88\% step-level localization accuracy and 85.37\% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

llm The Hong Kong University of Science and Technology · Zhejiang University of Technology

PDF arXiv

attack arXiv Sep 6, 2025 · Sep 2025

Red-Teaming Coding Agents from a Tool-Invocation Perspective: An Empirical Security Assessment

Yuchong Xie, Mingyu Luo, Zesen Liu et al. · The Hong Kong University of Science and Technology · Fudan University

Red-teams six coding agents via tool-invocation prompt injection and ToolLeak, achieving RCE and system prompt exfiltration across all tested agents

Prompt Injection Sensitive Information Disclosure Insecure Plugin Design nlp

PDF Code

Papers in Database (4)

On Protecting Agentic Systems' Intellectual Property via Watermarking

Disabling Self-Correction in Retrieval-Augmented Generation via Stealthy Retriever Poisoning

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Red-Teaming Coding Agents from a Tool-Invocation Perspective: An Empirical Security Assessment