benchmark 2025

Mind the Gap: Time-of-Check to Time-of-Use Vulnerabilities in LLM-Enabled Agents

Derek Lilienthal , Sanghyun Hong

0 citations

α

Published on arXiv

2508.17155

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Combining three countermeasures reduces TOCTOU vulnerabilities in executed agent trajectories from 12% to 8%, with a 95% reduction in the attack window.

TOCTOU-Bench

Novel technique introduced


Large Language Model (LLM)-enabled agents are rapidly emerging across a wide range of applications, but their deployment introduces vulnerabilities with security implications. While prior work has examined prompt-based attacks (e.g., prompt injection) and data-oriented threats (e.g., data exfiltration), time-of-check to time-of-use (TOCTOU) remain largely unexplored in this context. TOCTOU arises when an agent validates external state (e.g., a file or API response) that is later modified before use, enabling practical attacks such as malicious configuration swaps or payload injection. In this work, we present the first study of TOCTOU vulnerabilities in LLM-enabled agents. We introduce TOCTOU-Bench, a benchmark with 66 realistic user tasks designed to evaluate this class of vulnerabilities. As countermeasures, we adapt detection and mitigation techniques from systems security to this setting and propose prompt rewriting, state integrity monitoring, and tool-fusing. Our study highlights challenges unique to agentic workflows, where we achieve up to 25% detection accuracy using automated detection methods, a 3% decrease in vulnerable plan generation, and a 95% reduction in the attack window. When combining all three approaches, we reduce the TOCTOU vulnerabilities from an executed trajectory from 12% to 8%. Our findings open a new research direction at the intersection of AI safety and systems security.


Key Contributions

  • First systematic study of TOCTOU vulnerabilities in LLM-enabled agents, formalizing the attack model and threat surface
  • Introduction of TOCTOU-Bench: 66 realistic user tasks for evaluating TOCTOU susceptibility in agentic workflows
  • Adaptation of systems-security countermeasures (prompt rewriting, state integrity monitoring, tool-fusing) achieving 95% attack-window reduction and reducing executed trajectory vulnerabilities from 12% to 8%

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
TOCTOU-Bench
Applications
llm agentsagentic ai systemsfile/api-interacting agents