benchmark 2026

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

0 citations

Published on arXiv

2603.11862

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Documentation-embedded instruction injection achieves up to 85% end-to-end private data exfiltration success on a commercially deployed computer-use agent, with 0% human detection rate and no existing defense achieving reliable mitigation without unacceptable false positives.

ReadSecBench

Novel technique introduced

High-privilege LLM agents that autonomously process external documentation are increasingly trusted to automate tasks by reading and executing project instructions, yet they are granted terminal access, filesystem control, and outbound network connectivity with minimal security oversight. We identify and systematically measure a fundamental vulnerability in this trust model, which we term the \emph{Trusted Executor Dilemma}: agents execute documentation-embedded instructions, including adversarial ones, at high rates because they cannot distinguish malicious directives from legitimate setup guidance. This vulnerability is a structural consequence of the instruction-following design paradigm, not an implementation bug. To structure our measurement, we formalize a three-dimensional taxonomy covering linguistic disguise, structural obfuscation, and semantic abstraction, and construct \textbf{ReadSecBench}, a benchmark of 500 real-world README files enabling reproducible evaluation. Experiments on the commercially deployed computer-use agent show end-to-end exfiltration success rates up to 85\%, consistent across five programming languages and three injection positions. Cross-model evaluation on four LLM families in a simulation environment confirms that semantic compliance with injected instructions is consistent across model families. A 15-participant user study yields a 0\% detection rate across all participants, and evaluation of 12 rule-based and 6 LLM-based defenses shows neither category achieves reliable detection without unacceptable false-positive rates. Together, these results quantify a persistent \emph{Semantic-Safety Gap} between agents' functional compliance and their security awareness, establishing that documentation-embedded instruction injection is a persistent and currently unmitigated threat to high-privilege LLM agent deployments.

Key Contributions

Formalizes the 'Trusted Executor Dilemma' with a three-dimensional taxonomy of injection strategies (linguistic disguise, structural obfuscation, semantic abstraction) covering documentation-embedded prompt injection in LLM agents
Constructs ReadSecBench, a benchmark of 500 real-world README files enabling reproducible measurement of documentation-embedded injection attacks across five programming languages and three injection positions
Demonstrates end-to-end exfiltration success up to 85% on a commercially deployed computer-use agent, a 0% human detection rate across 15 participants, and that neither rule-based nor LLM-based defenses achieve reliable detection, quantifying a persistent Semantic-Safety Gap

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

ReadSecBench (500 real-world README files)

Applications

llm agentscomputer-use agentsautonomous coding assistantsdocumentation-processing pipelines

Read PDF arXiv

You Told Me to Do It: Measuring Instructional Text-induced Private Data Leakage in LLM Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

PEAR: Planner-Executor Agent Robustness Benchmark

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Mind the Gap: Comparing Model- vs Agentic-Level Red Teaming with Action-Graph Observability on GPT-OSS-20B

FinVault: Benchmarking Financial Agent Safety in Execution-Grounded Environments

The Silicon Psyche: Anthropomorphic Vulnerabilities in Large Language Models

CIBER: A Comprehensive Benchmark for Security Evaluation of Code Interpreter Agents

When Hallucination Costs Millions: Benchmarking AI Agents in High-Stakes Adversarial Financial Markets

Large Empirical Case Study: Go-Explore adapted for AI Red Team Testing