defense 2025

Soft Instruction De-escalation Defense

Nils Philipp Walter 1, Chawin Sitawarin 2, Jamie Hayes 2, David Stutz 3, Ilia Shumailov 2

2 citations · 36 references · arXiv

α

Published on arXiv

2510.21057

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Against a strong adversary embedding non-imperative workflows, SIC limits attack success rate to 15%, substantially raising the bar over undefended baselines.

SIC (Soft Instruction Control)

Novel technique introduced


Large Language Models (LLMs) are increasingly deployed in agentic systems that interact with an external environment; this makes them susceptible to prompt injections when dealing with untrusted data. To overcome this limitation, we propose SIC (Soft Instruction Control)-a simple yet effective iterative prompt sanitization loop designed for tool-augmented LLM agents. Our method repeatedly inspects incoming data for instructions that could compromise agent behavior. If such content is found, the malicious content is rewritten, masked, or removed, and the result is re-evaluated. The process continues until the input is clean or a maximum iteration limit is reached; if imperative instruction-like content remains, the agent halts to ensure security. By allowing multiple passes, our approach acknowledges that individual rewrites may fail but enables the system to catch and correct missed injections in later steps. Although immediately useful, worst-case analysis shows that SIC is not infallible; strong adversary can still get a 15% ASR by embedding non-imperative workflows. This nonetheless raises the bar.


Key Contributions

  • SIC (Soft Instruction Control): an iterative prompt sanitization loop that repeatedly inspects and rewrites/masks/removes injected instructions from external data before agent processing
  • Multi-pass design that compensates for individual rewrite failures by re-evaluating until clean or halting if malicious content persists
  • Worst-case adversarial analysis showing that even strong adversaries using non-imperative workflow embedding are limited to 15% ASR

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Applications
llm agentstool-augmented llm systems