defense 2025

Attacks by Content: Automated Fact-checking is an AI Security Issue

Michael Schlichtkrull

0 citations · 104 references · EMNLP

α

Published on arXiv

2510.11238

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Existing indirect prompt injection defenses fail against attacks by content, and automated fact-checking with source corroboration is proposed as the necessary defense paradigm for retrieval-augmented LLM agents.

Attack by Content

Novel technique introduced


When AI agents retrieve and reason over external documents, adversaries can manipulate the data they receive to subvert their behaviour. Previous research has studied indirect prompt injection, where the attacker injects malicious instructions. We argue that injection of instructions is not necessary to manipulate agents - attackers could instead supply biased, misleading, or false information. We term this an attack by content. Existing defenses, which focus on detecting hidden commands, are ineffective against attacks by content. To defend themselves and their users, agents must critically evaluate retrieved information, corroborating claims with external evidence and evaluating source trustworthiness. We argue that this is analogous to an existing NLP task, automated fact-checking, which we propose to repurpose as a cognitive self-defense tool for agents.


Key Contributions

  • Defines 'attack by content' as a novel threat distinct from instruction injection — adversaries manipulate LLM agents via biased, misleading, or false retrieved information
  • Demonstrates that existing defenses focused on detecting hidden commands are ineffective against content-based manipulation attacks
  • Proposes automated fact-checking repurposed as a cognitive self-defense tool for LLM agents, including corroborating claims with external evidence and evaluating source trustworthiness

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Applications
ai agentsrag systemsinformation retrieval pipelines