Latest papers

44 papers
defense arXiv Apr 25, 2026 · 26d ago

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen et al. · The Hong Kong University of Science and Technology · Xidian University

Taint tracking framework that detects malicious data flows in LLM agents from untrusted sources to privileged actions

Prompt Injection Insecure Plugin Design Blue-Team Agents nlp
PDF
defense arXiv Apr 17, 2026 · 4w ago

DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics

Jieming Yu, Qiuxiao Feng, Zhuohan Wang et al. · The Hong Kong University of Science and Technology · Harvard University

Foundation model baseline for image manipulation detection achieving 17-point F1 improvement over specialized forensic detectors

Output Integrity Attack vision
PDF Code
attack arXiv Apr 17, 2026 · 4w ago

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Ki Sen Hung, Xi Yang, Chang Liu et al. · The Hong Kong University of Science and Technology · University of Science and Technology of China

Context-based jailbreak attack achieving 93%+ success by exploiting safety-research framing to trigger broad defense relaxation across frontier LLMs

Prompt Injection nlp
PDF Code
defense arXiv Apr 17, 2026 · 4w ago

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

Junyi Li, Yongqiang Chen, Ningning Ding · The Hong Kong University of Science and Technology · The Chinese University of Hong Kong

Unlearns knowledge from reasoning model CoT traces via iterative preference optimization, evaluated against membership inference attacks

Membership Inference Attack nlp
PDF Code
attack ACL 2026 Main Conference Apr 16, 2026 · 5w ago

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

Haochun Tang, Yuliang Yan, Jiahua Lu et al. · Jilin University · The Hong Kong University of Science and Technology

Gradient-based adversarial suffix attack forcing LLM routers to select expensive models, bypassing cost-aware routing defenses

Input Manipulation Attack nlp
PDF Code
survey arXiv Apr 9, 2026 · 6w ago

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

Yuming Xu, Mingtao Zhang, Zhuohan Ge et al. · The Hong Kong Polytechnic University · The Hong Kong University of Science and Technology

Surveys RAG-specific security threats across knowledge corruption, retrieval manipulation, context exploitation, and exfiltration attacks

Prompt Injection Sensitive Information Disclosure nlp
PDF
defense arXiv Mar 26, 2026 · 8w ago

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology

Real-time monitor that detects adversarial manipulation of LLM chain-of-thought reasoning via step-level analysis and error classification

Prompt Injection Model Denial of Service nlp
PDF
defense arXiv Mar 24, 2026 · 8w ago

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

Yang Li, Yule Liu, Xinlei He et al. · Tsinghua University · The Hong Kong University of Science and Technology +1 more

Fine-tunes LLMs to generate explicit authorization reasoning chains before responses, defending against unauthorized access and prompt injection

Prompt Injection Sensitive Information Disclosure nlp
PDF
attack arXiv Mar 5, 2026 · 11w ago

Poisoning the Inner Prediction Logic of Graph Neural Networks for Clean-Label Backdoor Attacks

Yuxiang Zhang, Bin Ma, Enyan Dai · The Hong Kong University of Science and Technology

Clean-label backdoor attack on GNNs that poisons prediction logic without modifying training labels, surpassing SOTA methods

Model Poisoning graph
PDF Code
attack arXiv Mar 2, 2026 · 11w ago

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang et al. · Tsinghua University · The Chinese University of Hong Kong +4 more

Universal sponge attack on Video-LLMs inflates token generation 205× and inference latency 15× via optimized adversarial video frame triggers

Input Manipulation Attack Model Denial of Service multimodalvisionnlp
PDF Code
defense arXiv Mar 2, 2026 · 11w ago

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Xiaoyi Pang, Xuanyi Hao, Pengyu Liu et al. · arXiv · The Hong Kong University of Science and Technology +1 more

Detects backdoor and prompt injection attacks in black-box LLMs by monitoring token entropy lulls during generation

Model Poisoning Prompt Injection nlp
PDF Code
defense arXiv Mar 1, 2026 · 11w ago

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li, Zhening Liu, Zijian Li et al. · Lingnan University · The Hong Kong University of Science and Technology

Defends LLM safety alignment during fine-tuning by scoring and removing unsafe tokens via loss-difference between safety-degraded and utility-oriented reference models

Transfer Learning Attack Prompt Injection nlp
PDF Code
defense arXiv Feb 9, 2026 · Feb 2026

On Protecting Agentic Systems' Intellectual Property via Watermarking

Liwen Wang, Zongjie Li, Yuchong Xie et al. · The Hong Kong University of Science and Technology · HSBC

Watermarks agentic LLM systems by biasing tool execution paths, so stolen imitation models inherit detectable signatures

Model Theft Model Theft nlp
PDF
defense arXiv Feb 3, 2026 · Feb 2026

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu et al. · South China University of Technology · Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) +2 more

Training-free VLM defense that amplifies risk signals in visual tokens to block multimodal jailbreak attacks without utility loss

Input Manipulation Attack Prompt Injection visionnlpmultimodal
PDF
attack arXiv Jan 30, 2026 · Jan 2026

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Zhixiang Zhang, Zesen Liu, Yuchong Xie et al. · The Hong Kong University of Science and Technology · Fudan University

CacheAttack exploits semantic cache collision vulnerabilities to hijack LLM responses at 86% success rate across major providers

Output Integrity Attack Prompt Injection nlp
PDF
attack arXiv Jan 16, 2026 · Jan 2026

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Kaiyu Zhou, Yongsen Zheng, Yicheng He et al. · Nanyang Technological University · University of Illinois Urbana-Champaign +2 more

Stealthy multi-turn economic DoS attack manipulates MCP tool servers to inflate LLM agent costs 658x while keeping task outputs correct

Model Denial of Service Insecure Plugin Design nlp
2 citations 1 influentialPDF
defense arXiv Jan 12, 2026 · Jan 2026

A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model

Qi Zheng, Shuliang Liu, Yu Huang et al. · The Hong Kong University of Science and Technology (Guangzhou) · The Hong Kong University of Science and Technology +1 more

Watermarks VLM-generated text via visual-evidence-guided token partitioning, improving visual fidelity while maintaining 96.88% AUC detection accuracy

Output Integrity Attack nlpmultimodal
PDF
defense arXiv Jan 8, 2026 · Jan 2026

Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

Shuliang Liu, Xingyu Li, Hongyi Liu et al. · The Hong Kong University of Science and Technology (Guangzhou) · The Hong Kong University of Science and Technology +1 more

Watermarks reasoning LLM text outputs by separating thinking from answering and adapting strength via semantic vectors

Output Integrity Attack nlp
1 citations PDF Code
attack arXiv Dec 22, 2025 · Dec 2025

6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

Jihui Guo, Zongmin Zhang, Zhen Sun et al. · The University of Hong Kong · The Hong Kong University of Science and Technology +2 more

Backdoor attack on 6DoF pose estimation using 3D object triggers to induce controlled erroneous rotations and translations with 100% ASR

Model Poisoning vision
1 citations PDF Code
tool arXiv Dec 22, 2025 · Dec 2025

DREAM: Dynamic Red-teaming across Environments for AI Models

Liming Lu, Xiang Gu, Junyu Huang et al. · Nanjing University of Science and Technology · The University of Hong Kong +3 more

Automated red-teaming tool for LLM agents that chains 1,986 atomic attacks across 349 environments, achieving 70%+ bypass rates

Prompt Injection Excessive Agency nlp
PDF
Loading more papers…