Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System
Haorui He 1,2, Yupeng Li 1,2, Bin Benjamin Zhu 3, Dacheng Wen 1,2, Reynold Cheng 2, Francis C. M. Lau 2
Published on arXiv
2508.06059
Data Poisoning Attack
OWASP ML Top 10 — ML02
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Fact2Fiction achieves 8.9%–21.2% higher attack success rates than PoisonedRAG across varying poisoning budgets against DEFAME and InFact agentic fact-checking systems.
Fact2Fiction
Novel technique introduced
State-of-the-art (SOTA) fact-checking systems combat misinformation by employing autonomous LLM-based agents to decompose complex claims into smaller sub-claims, verify each sub-claim individually, and aggregate the partial results to produce verdicts with justifications (explanations for the verdicts). The security of these systems is crucial, as compromised fact-checkers can amplify misinformation, but remains largely underexplored. To bridge this gap, this work introduces a novel threat model against such fact-checking systems and presents \textsc{Fact2Fiction}, the first poisoning attack framework targeting SOTA agentic fact-checking systems. Fact2Fiction employs LLMs to mimic the decomposition strategy and exploit system-generated justifications to craft tailored malicious evidences that compromise sub-claim verification. Extensive experiments demonstrate that Fact2Fiction achieves 8.9\%--21.2\% higher attack success rates than SOTA attacks across various poisoning budgets and exposes security weaknesses in existing fact-checking systems, highlighting the need for defensive countermeasures.
Key Contributions
- First poisoning attack framework (Fact2Fiction) targeting agentic, claim-decomposition-based LLM fact-checking systems rather than naive single-pass RAG systems
- Novel threat model that exploits system-generated justifications to identify vulnerable sub-claims and craft targeted malicious evidence aligned with the system's own reasoning
- Empirical demonstration of 8.9%–21.2% higher attack success rates over PoisonedRAG with only 6.3%–12.5% of the malicious evidence budget, exposing a transparency-security trade-off in justification-producing fact-checkers
🛡️ Threat Analysis
The attack injects crafted malicious evidences into the retrieval corpus (knowledge base) used by the agentic fact-checking system — a data/knowledge store poisoning attack that corrupts the information source the model relies on for inference.