ML Security Papers

Stats

Latest papers

44 papers

defense arXiv Apr 25, 2026 · 26d ago

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

Yuandao Cai, Wensheng Tang, Cheng Wen et al. · The Hong Kong University of Science and Technology · Xidian University

Taint tracking framework that detects malicious data flows in LLM agents from untrusted sources to privileged actions

Prompt Injection Insecure Plugin Design Blue-Team Agents nlp

PDF

defense arXiv Apr 17, 2026 · 4w ago

DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics

Jieming Yu, Qiuxiao Feng, Zhuohan Wang et al. · The Hong Kong University of Science and Technology · Harvard University

Foundation model baseline for image manipulation detection achieving 17-point F1 improvement over specialized forensic detectors

Output Integrity Attack vision

PDF Code

attack arXiv Apr 17, 2026 · 4w ago

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

Ki Sen Hung, Xi Yang, Chang Liu et al. · The Hong Kong University of Science and Technology · University of Science and Technology of China

Context-based jailbreak attack achieving 93%+ success by exploiting safety-research framing to trigger broad defense relaxation across frontier LLMs

Prompt Injection nlp

PDF Code

defense arXiv Apr 17, 2026 · 4w ago

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

Junyi Li, Yongqiang Chen, Ningning Ding · The Hong Kong University of Science and Technology · The Chinese University of Hong Kong

Unlearns knowledge from reasoning model CoT traces via iterative preference optimization, evaluated against membership inference attacks

Membership Inference Attack nlp

PDF Code

attack ACL 2026 Main Conference Apr 16, 2026 · 5w ago

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

Haochun Tang, Yuliang Yan, Jiahua Lu et al. · Jilin University · The Hong Kong University of Science and Technology

Gradient-based adversarial suffix attack forcing LLM routers to select expensive models, bypassing cost-aware routing defenses

Input Manipulation Attack nlp

PDF Code

survey arXiv Apr 9, 2026 · 6w ago

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

Yuming Xu, Mingtao Zhang, Zhuohan Ge et al. · The Hong Kong Polytechnic University · The Hong Kong University of Science and Technology

Surveys RAG-specific security threats across knowledge corruption, retrieval manipulation, context exploitation, and exfiltration attacks

Prompt Injection Sensitive Information Disclosure nlp

PDF

defense arXiv Mar 26, 2026 · 8w ago

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology

Real-time monitor that detects adversarial manipulation of LLM chain-of-thought reasoning via step-level analysis and error classification

Prompt Injection Model Denial of Service nlp

PDF

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety--detecting harmful, biased, or factually incorrect outputs -- and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88\% step-level localization accuracy and 85.37\% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

llm The Hong Kong University of Science and Technology · Zhejiang University of Technology

PDF arXiv

defense arXiv Mar 24, 2026 · 8w ago

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

Yang Li, Yule Liu, Xinlei He et al. · Tsinghua University · The Hong Kong University of Science and Technology +1 more

Fine-tunes LLMs to generate explicit authorization reasoning chains before responses, defending against unauthorized access and prompt injection

Prompt Injection Sensitive Information Disclosure nlp

PDF

attack arXiv Mar 5, 2026 · 11w ago

Poisoning the Inner Prediction Logic of Graph Neural Networks for Clean-Label Backdoor Attacks

Yuxiang Zhang, Bin Ma, Enyan Dai · The Hong Kong University of Science and Technology

Clean-label backdoor attack on GNNs that poisons prediction logic without modifying training labels, surpassing SOTA methods

Model Poisoning graph

PDF Code

attack arXiv Mar 2, 2026 · 11w ago

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang et al. · Tsinghua University · The Chinese University of Hong Kong +4 more

Universal sponge attack on Video-LLMs inflates token generation 205× and inference latency 15× via optimized adversarial video frame triggers

Input Manipulation Attack Model Denial of Service multimodalvisionnlp

PDF Code

defense arXiv Mar 2, 2026 · 11w ago

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Xiaoyi Pang, Xuanyi Hao, Pengyu Liu et al. · arXiv · The Hong Kong University of Science and Technology +1 more

Detects backdoor and prompt injection attacks in black-box LLMs by monitoring token entropy lulls during generation

Model Poisoning Prompt Injection nlp

PDF Code

defense arXiv Mar 1, 2026 · 11w ago

Token-level Data Selection for Safe LLM Fine-tuning

Yanping Li, Zhening Liu, Zijian Li et al. · Lingnan University · The Hong Kong University of Science and Technology

Defends LLM safety alignment during fine-tuning by scoring and removing unsafe tokens via loss-difference between safety-degraded and utility-oriented reference models

Transfer Learning Attack Prompt Injection nlp

PDF Code

defense arXiv Feb 9, 2026 · Feb 2026

On Protecting Agentic Systems' Intellectual Property via Watermarking

Liwen Wang, Zongjie Li, Yuchong Xie et al. · The Hong Kong University of Science and Technology · HSBC

Watermarks agentic LLM systems by biasing tool execution paths, so stolen imitation models inherit detectable signatures

Model Theft Model Theft nlp

PDF

defense arXiv Feb 3, 2026 · Feb 2026

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

Mengxuan Wang, Yuxin Chen, Gang Xu et al. · South China University of Technology · Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ) +2 more

Training-free VLM defense that amplifies risk signals in visual tokens to block multimodal jailbreak attacks without utility loss

Input Manipulation Attack Prompt Injection visionnlpmultimodal

PDF

attack arXiv Jan 30, 2026 · Jan 2026

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Zhixiang Zhang, Zesen Liu, Yuchong Xie et al. · The Hong Kong University of Science and Technology · Fudan University

CacheAttack exploits semantic cache collision vulnerabilities to hijack LLM responses at 86% success rate across major providers

Output Integrity Attack Prompt Injection nlp

PDF

attack arXiv Jan 16, 2026 · Jan 2026

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Kaiyu Zhou, Yongsen Zheng, Yicheng He et al. · Nanyang Technological University · University of Illinois Urbana-Champaign +2 more

Stealthy multi-turn economic DoS attack manipulates MCP tool servers to inflate LLM agent costs 658x while keeping task outputs correct

Model Denial of Service Insecure Plugin Design nlp

2 citations 1 influentialPDF

defense arXiv Jan 12, 2026 · Jan 2026

A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model

Qi Zheng, Shuliang Liu, Yu Huang et al. · The Hong Kong University of Science and Technology (Guangzhou) · The Hong Kong University of Science and Technology +1 more

Watermarks VLM-generated text via visual-evidence-guided token partitioning, improving visual fidelity while maintaining 96.88% AUC detection accuracy

Output Integrity Attack nlpmultimodal

PDF

defense arXiv Jan 8, 2026 · Jan 2026

Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

Shuliang Liu, Xingyu Li, Hongyi Liu et al. · The Hong Kong University of Science and Technology (Guangzhou) · The Hong Kong University of Science and Technology +1 more

Watermarks reasoning LLM text outputs by separating thinking from answering and adapting strength via semantic vectors

Output Integrity Attack nlp

1 citations PDF Code

attack arXiv Dec 22, 2025 · Dec 2025

6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

Jihui Guo, Zongmin Zhang, Zhen Sun et al. · The University of Hong Kong · The Hong Kong University of Science and Technology +2 more

Backdoor attack on 6DoF pose estimation using 3D object triggers to induce controlled erroneous rotations and translations with 100% ASR

Model Poisoning vision

1 citations PDF Code

tool arXiv Dec 22, 2025 · Dec 2025

DREAM: Dynamic Red-teaming across Environments for AI Models

Liming Lu, Xiang Gu, Junyu Huang et al. · Nanjing University of Science and Technology · The University of Hong Kong +3 more

Automated red-teaming tool for LLM agents that chains 1,986 atomic attacks across 349 environments, achieving 70%+ bypass rates

Prompt Injection Excessive Agency nlp

PDF

Loading more papers…

Latest papers

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

DINOv3 Beats Specialized Detectors: A Simple Foundation Model Baseline for Image Forensics

Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

Route to Rome Attack: Directing LLM Routers to Expensive Models via Adversarial Suffix Optimization

Securing Retrieval-Augmented Generation: A Taxonomy of Attacks, Defenses, and Future Directions

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Chain-of-Authorization: Internalizing Authorization into Large Language Models via Reasoning Trajectories

Poisoning the Inner Prediction Logic of Graph Neural Networks for Clean-Label Backdoor Attacks

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Token-level Data Selection for Safe LLM Fine-tuning

On Protecting Agentic Systems' Intellectual Property via Watermarking

Risk Awareness Injection: Calibrating Vision-Language Models for Safety without Compromising Utility

From Similarity to Vulnerability: Key Collision Attack on LLM Semantic Caching

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model

Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

6DAttack: Backdoor Attacks in the 6DoF Pose Estimation

DREAM: Dynamic Red-teaming across Environments for AI Models

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue