ML Security Papers

Latest papers

23 papers

defense arXiv Mar 26, 2026 · 13d ago

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Xunguang Wang, Yuguang Zhou, Qingyue Wang et al. · The Hong Kong University of Science and Technology · Zhejiang University of Technology

Real-time monitor that detects adversarial manipulation of LLM chain-of-thought reasoning via step-level analysis and error classification

Prompt Injection Model Denial of Service nlp

PDF

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety--detecting harmful, biased, or factually incorrect outputs -- and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88\% step-level localization accuracy and 85.37\% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

llm The Hong Kong University of Science and Technology · Zhejiang University of Technology

PDF arXiv

attack arXiv Mar 14, 2026 · 25d ago

ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering

Hussein Jawad, Nicolas J-B Brunel · Capgemini Invent · University Paris-Saclay +1 more

Denial-of-service attack on LLM agents that injects adversarial tools to dominate retrieval and hide all legitimate tools

Input Manipulation Attack Insecure Plugin Design Model Denial of Service nlp

PDF Code

attack arXiv Mar 2, 2026 · 5w ago

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Duoxun Tang, Dasen Dai, Jiyao Wang et al. · Tsinghua University · The Chinese University of Hong Kong +4 more

Universal sponge attack on Video-LLMs inflates token generation 205× and inference latency 15× via optimized adversarial video frame triggers

Input Manipulation Attack Model Denial of Service multimodalvisionnlp

PDF Code

attack arXiv Mar 1, 2026 · 5w ago

Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents

Ben Dong, Hui Feng, Qian Wang · University of California

Trojanized LLM agent skill exploits tool-calling loops to achieve 6-9x token amplification in production OpenClaw deployments

Model Denial of Service Insecure Plugin Design nlp

PDF

attack arXiv Feb 19, 2026 · 6w ago

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Zachary Coalson, Bo Fang, Sanghyun Hong · Oregon State University · University of Texas at Arlington

Discovers turn amplification as an LLM resource-exhaustion attack, using mechanistic activation analysis to enable persistent fine-tuning and parameter-corruption attack vectors

Model Poisoning Model Denial of Service nlp

PDF

attack arXiv Feb 16, 2026 · 7w ago

Overthinking Loops in Agents: A Structural Risk via MCP Tools

Yohan Lee, Jisoo Jang, Seoyeon Choi et al. · Yonsei University · Hankuk University of Foreign Studies +1 more

Malicious MCP tool servers induce overthinking loops in LLM agents, achieving up to 142× token amplification via crafted tool call cycles

Model Denial of Service Insecure Plugin Design nlp

PDF

attack arXiv Feb 9, 2026 · 8w ago

RECUR: Resource Exhaustion Attack via Recursive-Entropy Guided Counterfactual Utilization and Reflection

Ziwei Wang, Yuanhe Zhang, Jing Chen et al. · Wuhan University · Beijing University of Posts and Telecommunications +3 more

Crafts counterfactual prompts using Recursive Entropy to force LRMs into infinite thinking loops, reducing throughput by 90%

Model Denial of Service nlp

PDF

attack arXiv Feb 8, 2026 · 8w ago

Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model

Tianyi Wang, Huawei Fan, Yuanchao Shu et al. · Zhejiang University

System-level DoS attack on LLM serving frameworks exploiting KV cache exhaustion and scheduler preemption for 20-280x latency amplification

Model Denial of Service nlp

PDF

attack arXiv Jan 29, 2026 · 9w ago

ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

Xiaogeng Liu, Xinyan Wang, Yechao Zhang et al. · Johns Hopkins University · NVIDIA +4 more

RL-trained attacker generates short natural prompts that force LRMs into pathologically long reasoning, achieving 286x amplification and >98% detection bypass

Model Denial of Service nlpreinforcement-learning

PDF

attack ASE Jan 28, 2026 · 10w ago

DRAINCODE: Stealthy Energy Consumption Attacks on Retrieval-Augmented Code Generation via Context Poisoning

Yanlin Wang, Jiadong Wu, Tianyue Jiang et al. · Sun Yat-Sen University · Nanyang Technological University +1 more

Poisons RAG retrieval contexts with mutated code to force LLMs into verbose outputs, causing 85% latency and 49% energy consumption increases.

Model Denial of Service Prompt Injection nlp

PDF Code

defense arXiv Jan 27, 2026 · 10w ago

SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks

Nirhoshan Sivaroopan, Kanchana Thilakarathna, Albert Zomaya et al. · University of New South Wales · University of Wollongong

Multi-agent auto-healing defense framework that detects and adapts to sponge attacks exhausting LLM compute resources

Model Denial of Service nlp

PDF

attack arXiv Jan 24, 2026 · 10w ago

Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agentic Reasoning

Qi Li, Xinchao Wang · National University of Singapore

Prompt-rewriting attack forces tool-augmented LLM agents into verbose, inefficient reasoning trajectories to drain compute resources stealthily

Model Denial of Service nlp

3 citations PDF

attack arXiv Jan 19, 2026 · 11w ago

CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation

Xiaolei Zhang, Xiaojun Jia, Liquan Chen et al. · Southeast University · Nanyang Technological University

Poisons RAG knowledge bases with contradiction-laden documents to cause 5–25x reasoning token overconsumption in LLMs without affecting accuracy

Prompt Injection Model Denial of Service nlp

PDF

attack arXiv Jan 16, 2026 · 11w ago

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

Kaiyu Zhou, Yongsen Zheng, Yicheng He et al. · Nanyang Technological University · University of Illinois Urbana-Champaign +2 more

Stealthy multi-turn economic DoS attack manipulates MCP tool servers to inflate LLM agent costs 658x while keeping task outputs correct

Model Denial of Service Insecure Plugin Design nlp

2 citations 1 influentialPDF

attack arXiv Dec 30, 2025 · Dec 2025

RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress

Ruixuan Huang, Qingyue Wang, Hantao Huang et al. · Hong Kong University of Science and Technology · Nanyang Technological University

Black-box DoS attack exploits MoE router imbalance via repetitive token patterns, causing 3x latency spike on Mixtral-8x7B

Model Denial of Service nlp

PDF

benchmark arXiv Dec 29, 2025 · Dec 2025

Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark

Manu, Yi Guo, Kanchana Thilakarathna et al. · The University of Sydney · University of New South Wales +1 more

Benchmarks black-box LLM DoS attacks using evolutionary and RL-based prompt search to suppress EOS and inflate output length

Model Denial of Service nlp

1 citations 1 influentialPDF

attack arXiv Dec 8, 2025 · Dec 2025

ThinkTrap: Denial-of-Service Attacks against Black-box LLM Services via Infinite Thinking

Yunzhe Li, Jianan Wang, Hongzi Zhu et al. · Shanghai Jiao Tong University · Donghua University

Black-box adversarial prompt optimization traps LLMs in infinite generation loops, degrading cloud service throughput to 1%

Model Denial of Service nlp

7 citations 1 influentialPDF

attack arXiv Nov 20, 2025 · Nov 2025

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

Zhi Luo, Zenghui Yuan, Wenqi Wei et al. · Huazhong University of Science and Technology · Fordham University +1 more

Adversarial image perturbations force VLMs to generate verbose outputs via RL-optimized prompt embeddings, causing resource exhaustion DoS

Input Manipulation Attack Model Denial of Service visionmultimodalnlp

PDF

attack arXiv Nov 13, 2025 · Nov 2025

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

Shuaitong Liu, Renjue Li, Lijia Yu et al. · Southwest University · Chinese Academy of Sciences +1 more

Backdoor attack poisons LLM fine-tuning to trigger 17x CoT trace inflation for stealthy compute exhaustion

Model Poisoning Model Denial of Service nlp

1 citations PDF

attack arXiv Nov 11, 2025 · Nov 2025

LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation

Xingyu Li, Xiaolei Liu, Cheng Liu et al. · National Interdisciplinary Research Center of Engineering Physics · Institute of Computer Application +2 more

Gradient-based adversarial prompt attack forces LLMs into repetitive loops, exhausting compute resources up to max output length

Model Denial of Service nlp

4 citations 2 influentialPDF Code

Loading more papers…

Latest papers

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering

VidDoS: Universal Denial-of-Service Attack on Video-based Large Language Models

Clawdrain: Exploiting Tool-Calling Chains for Stealthy Token Exhaustion in OpenClaw Agents

Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

Overthinking Loops in Agents: A Structural Risk via MCP Tools

RECUR: Resource Exhaustion Attack via Recursive-Entropy Guided Counterfactual Utilization and Reflection

Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model

ReasoningBomb: A Stealthy Denial-of-Service Attack by Inducing Pathologically Long Reasoning in Large Reasoning Models

DRAINCODE: Stealthy Energy Consumption Attacks on Retrieval-Augmented Code Generation via Context Poisoning

SHIELD: An Auto-Healing Agentic Defense Framework for LLM Resource Exhaustion Attacks

Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agentic Reasoning

CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation

Beyond Max Tokens: Stealthy Resource Amplification via Tool Calling Chains in LLM Agents

RepetitionCurse: Measuring and Understanding Router Imbalance in Mixture-of-Experts LLMs under DoS Stress

Prompt-Induced Over-Generation as Denial-of-Service: A Black-Box Attack-Side Benchmark

ThinkTrap: Denial-of-Service Attacks against Black-box LLM Services via Infinite Thinking

An Image Is Worth Ten Thousand Words: Verbose-Text Induction Attacks on VLMs

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models

LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue