Attacks by Content: Automated Fact-checking is an AI Security Issue

When AI agents retrieve and reason over external documents, adversaries can manipulate the data they receive to subvert their behaviour. Previous research has studied indirect prompt injection, where the attacker injects malicious instructions. We argue that injection of instructions is not necessary to manipulate agents - attackers could instead supply biased, misleading, or false information. We term this an attack by content. Existing defenses, which focus on detecting hidden commands, are ineffective against attacks by content. To defend themselves and their users, agents must critically evaluate retrieved information, corroborating claims with external evidence and evaluating source trustworthiness. We argue that this is analogous to an existing NLP task, automated fact-checking, which we propose to repurpose as a cognitive self-defense tool for agents.

Key Contributions

Defines 'attack by content' as a novel threat distinct from instruction injection — adversaries manipulate LLM agents via biased, misleading, or false retrieved information
Demonstrates that existing defenses focused on detecting hidden commands are ineffective against content-based manipulation attacks
Proposes automated fact-checking repurposed as a cognitive self-defense tool for LLM agents, including corroborating claims with external evidence and evaluating source trustworthiness

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

ai agentsrag systemsinformation retrieval pipelines

2025 0 cit.

100%