Attacks by Content: Automated Fact-checking is an AI Security Issue
Published on arXiv
2510.11238
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Existing indirect prompt injection defenses fail against attacks by content, and automated fact-checking with source corroboration is proposed as the necessary defense paradigm for retrieval-augmented LLM agents.
Attack by Content
Novel technique introduced
When AI agents retrieve and reason over external documents, adversaries can manipulate the data they receive to subvert their behaviour. Previous research has studied indirect prompt injection, where the attacker injects malicious instructions. We argue that injection of instructions is not necessary to manipulate agents - attackers could instead supply biased, misleading, or false information. We term this an attack by content. Existing defenses, which focus on detecting hidden commands, are ineffective against attacks by content. To defend themselves and their users, agents must critically evaluate retrieved information, corroborating claims with external evidence and evaluating source trustworthiness. We argue that this is analogous to an existing NLP task, automated fact-checking, which we propose to repurpose as a cognitive self-defense tool for agents.
Key Contributions
- Defines 'attack by content' as a novel threat distinct from instruction injection — adversaries manipulate LLM agents via biased, misleading, or false retrieved information
- Demonstrates that existing defenses focused on detecting hidden commands are ineffective against content-based manipulation attacks
- Proposes automated fact-checking repurposed as a cognitive self-defense tool for LLM agents, including corroborating claims with external evidence and evaluating source trustworthiness