Latest papers

38 papers
defense arXiv Mar 26, 2026 · 11d ago

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology

Real-time monitor that detects adversarial manipulation of LLM chain-of-thought reasoning via step-level analysis and error classification

Prompt Injection Model Denial of Service nlp
PDF
defense arXiv Mar 24, 2026 · 13d ago

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

Yang Li, Yule Liu, Xinlei He et al. · Tsinghua University · The Hong Kong University of Science and Technology +1 more

Fine-tunes LLMs to generate explicit authorization reasoning chains before responses, defending against unauthorized access and prompt injection

Prompt Injection Sensitive Information Disclosure nlp
PDF
attack arXiv Mar 5, 2026 · 4w ago

Poisoning the Inner Prediction Logic of Graph Neural Networks for Clean-Label Backdoor Attacks

Yuxiang Zhang, Bin Ma, Enyan Dai · The Hong Kong University of Science and Technology

Clean-label backdoor attack on GNNs that poisons prediction logic without modifying training labels, surpassing SOTA methods

Model Poisoning graph
PDF Code
attack arXiv Mar 2, 2026 · 5w ago

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang et al. · Tsinghua University · The Chinese University of Hong Kong +4 more

Universal sponge attack on Video-LLMs inflates token generation 205× and inference latency 15× via optimized adversarial video frame triggers

Input Manipulation Attack Model Denial of Service multimodalvisionnlp
PDF Code
defense arXiv Mar 2, 2026 · 5w ago

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Xiaoyi Pang, Xuanyi Hao, Pengyu Liu et al. · arXiv · The Hong Kong University of Science and Technology +1 more

Detects backdoor and prompt injection attacks in black-box LLMs by monitoring token entropy lulls during generation

Model Poisoning Prompt Injection nlp
PDF Code
defense arXiv Mar 1, 2026 · 5w ago

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li, Zhening Liu, Zijian Li et al. · Lingnan University · The Hong Kong University of Science and Technology

Defends LLM safety alignment during fine-tuning by scoring and removing unsafe tokens via loss-difference between safety-degraded and utility-oriented reference models

Transfer Learning Attack Prompt Injection nlp
PDF Code
defense arXiv Feb 9, 2026 · 8w ago

On Protecting Agentic Systems' Intellectual Property via Watermarking

Liwen Wang, Zongjie Li, Yuchong Xie et al. · The Hong Kong University of Science and Technology · HSBC

Watermarks agentic LLM systems by biasing tool execution paths, so stolen imitation models inherit detectable signatures

Model Theft Model Theft nlp
PDF
defense arXiv Feb 3, 2026 · 8w ago

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu et al. · South China University of Technology · Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) +2 more

Training-free VLM defense that amplifies risk signals in visual tokens to block multimodal jailbreak attacks without utility loss

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
attack arXiv Jan 30, 2026 · 9w ago

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Zhixiang Zhang, Zesen Liu, Yuchong Xie et al. · The Hong Kong University of Science and Technology · Fudan University

CacheAttack exploits semantic cache collision vulnerabilities to hijack LLM responses at 86% success rate across major providers

Output Integrity Attack Prompt Injection nlp
PDF
attack arXiv Jan 16, 2026 · 11w ago

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Kaiyu Zhou, Yongsen Zheng, Yicheng He et al. · Nanyang Technological University · University of Illinois Urbana-Champaign +2 more

Stealthy multi-turn economic DoS attack manipulates MCP tool servers to inflate LLM agent costs 658x while keeping task outputs correct

Model Denial of Service Insecure Plugin Design nlp
2 citations 1 influentialPDF
defense arXiv Jan 12, 2026 · 12w ago

A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model

Qi Zheng, Shuliang Liu, Yu Huang et al. · The Hong Kong University of Science and Technology (Guangzhou) · The Hong Kong University of Science and Technology +1 more

Watermarks VLM-generated text via visual-evidence-guided token partitioning, improving visual fidelity while maintaining 96.88% AUC detection accuracy

Output Integrity Attack nlpmultimodal
PDF
defense arXiv Jan 8, 2026 · 12w ago

Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

Shuliang Liu, Xingyu Li, Hongyi Liu et al. · The Hong Kong University of Science and Technology (Guangzhou) · The Hong Kong University of Science and Technology +1 more

Watermarks reasoning LLM text outputs by separating thinking from answering and adapting strength via semantic vectors

Output Integrity Attack nlp
1 citations PDF Code
attack arXiv Dec 22, 2025 · Dec 2025

6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

Jihui Guo, Zongmin Zhang, Zhen Sun et al. · The University of Hong Kong · The Hong Kong University of Science and Technology +2 more

Backdoor attack on 6DoF pose estimation using 3D object triggers to induce controlled erroneous rotations and translations with 100% ASR

Model Poisoning vision
1 citations PDF Code
tool arXiv Dec 22, 2025 · Dec 2025

DREAM: Dynamic Red-teaming across Environments for AI Models

Liming Lu, Xiang Gu, Junyu Huang et al. · Nanjing University of Science and Technology · The University of Hong Kong +3 more

Automated red-teaming tool for LLM agents that chains 1,986 atomic attacks across 349 environments, achieving 70%+ bypass rates

Prompt Injection Excessive Agency nlp
PDF
attack arXiv Dec 12, 2025 · Dec 2025

Attacking and Securing Community Detection: A Game-Theoretic Framework

Yifan Niu, Aochuan Chen, Tingyang Xu et al. · The Hong Kong University of Science and Technology · Alibaba Group

Proposes adversarial graph perturbations and a Nash equilibrium game framework to attack and defend GNN-based community detection

Input Manipulation Attack graph
PDF
attack arXiv Nov 20, 2025 · Nov 2025

"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios

Zhen Sun, Zongmin Zhang, Deqi Liang et al. · The Hong Kong University of Science and Technology · East China Normal University +5 more

Game-theoretic black-box jailbreak using Prisoner's Dilemma scenarios to flip LLM safety preferences, achieving 95%+ ASR on GPT-4o and DeepSeek-R1

Prompt Injection nlp
2 citations PDF Code
survey arXiv Nov 19, 2025 · Nov 2025

Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks

Zimo Ji, Xunguang Wang, Zongjie Li et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology +3 more

SoK paper taxonomizes IPI defenses for LLM agents, identifies six bypass root causes, and proposes three novel adaptive attacks

Prompt Injection nlp
PDF
attack arXiv Nov 18, 2025 · Nov 2025

GRPO Privacy Is at Risk: A Membership Inference Attack Against Reinforcement Learning With Verifiable Rewards

Yule Liu, Heyi Zhang, Jinyi Zheng et al. · The Hong Kong University of Science and Technology · Shanghai Jiao Tong University +2 more

First membership inference attack against RLVR-trained LLMs using behavioral divergence signals instead of memorization

Membership Inference Attack nlpmultimodalreinforcement-learning
1 citations PDF
defense arXiv Nov 14, 2025 · Nov 2025

SEAL: Subspace-Anchored Watermarks for LLM Ownership

Yanbo Dai, Zongjie Li, Zhenlan Ji et al. · The Hong Kong University of Science and Technology

Embeds multi-bit ownership watermarks into LLM latent representations, surviving fine-tuning and resisting knowledgeable removal attacks

Model Theft Model Theft nlp
PDF
defense arXiv Nov 3, 2025 · Nov 2025

Detecting Generated Images by Fitting Natural Image Distributions

Yonggang Zhang, Jun Nie, Xinmei Tian et al. · The Hong Kong University of Science and Technology · Hong Kong Baptist University +4 more

Proposes ConV, a generated-image detector exploiting data manifold geometry requiring no generated training samples

Output Integrity Attack visiongenerative
2 citations PDF Code
Loading more papers…