defense 2025

CommandSans: Securing AI Agents with Surgical Precision Prompt Sanitization

Debeshee Das 1,2, Luca Beurer-Kellner 1, Marc Fischer 1, Maximilian Baader 1

4 citations · 49 references · arXiv

α

Published on arXiv

2510.08829

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Reduces attack success rate from 34% to 3% on AgentDojo (up to 19x reduction across benchmarks) without impairing agent utility in benign settings

CommandSans

Novel technique introduced


The increasing adoption of LLM agents with access to numerous tools and sensitive data significantly widens the attack surface for indirect prompt injections. Due to the context-dependent nature of attacks, however, current defenses are often ill-calibrated as they cannot reliably differentiate malicious and benign instructions, leading to high false positive rates that prevent their real-world adoption. To address this, we present a novel approach inspired by the fundamental principle of computer security: data should not contain executable instructions. Instead of sample-level classification, we propose a token-level sanitization process, which surgically removes any instructions directed at AI systems from tool outputs, capturing malicious instructions as a byproduct. In contrast to existing safety classifiers, this approach is non-blocking, does not require calibration, and is agnostic to the context of tool outputs. Further, we can train such token-level predictors with readily available instruction-tuning data only, and don't have to rely on unrealistic prompt injection examples from challenges or of other synthetic origin. In our experiments, we find that this approach generalizes well across a wide range of attacks and benchmarks like AgentDojo, BIPIA, InjecAgent, ASB and SEP, achieving a 7-10x reduction of attack success rate (ASR) (34% to 3% on AgentDojo), without impairing agent utility in both benign and malicious settings.


Key Contributions

  • Formulates the instruction tagging problem as an alternative to sample-level prompt injection detection, enabling neutralization of any AI-directed instructions in tool outputs
  • Presents CommandSans, a non-blocking token-level sanitization system trained on instruction-tuning data without requiring specialized prompt injection examples
  • Achieves 7-10x reduction in attack success rate (34% to 3% on AgentDojo) across multiple benchmarks while preserving agent utility

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
AgentDojoBIPIAInjecAgentASBSEP
Applications
llm agentstool-augmented ai systemsemail agentsweb browser agentscode assistants