attack 2026

LLM-Based Adversarial Persuasion Attacks on Fact-Checking Systems

João A. Leite , Olesya Razuvayevskaya , Kalina Bontcheva , Carolina Scarton

0 citations · 43 references · arXiv

α

Published on arXiv

2601.16890

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Optimized persuasion injection collapses AFC accuracy to near-zero even with gold evidence provided, reducing accuracy more than twice as much as prior adversarial attacks such as synonym substitution and character noise

Persuasion Injection Attack

Novel technique introduced


Automated fact-checking (AFC) systems are susceptible to adversarial attacks, enabling false claims to evade detection. Existing adversarial frameworks typically rely on injecting noise or altering semantics, yet no existing framework exploits the adversarial potential of persuasion techniques, which are widely used in disinformation campaigns to manipulate audiences. In this paper, we introduce a novel class of persuasive adversarial attacks on AFCs by employing a generative LLM to rephrase claims using persuasion techniques. Considering 15 techniques grouped into 6 categories, we study the effects of persuasion on both claim verification and evidence retrieval using a decoupled evaluation strategy. Experiments on the FEVER and FEVEROUS benchmarks show that persuasion attacks can substantially degrade both verification performance and evidence retrieval. Our analysis identifies persuasion techniques as a potent class of adversarial attacks, highlighting the need for more robust AFC systems.


Key Contributions

  • Persuasion injection attack framework using LLMs to rephrase claims with 15 persuasion techniques (6 categories) to evade AFC systems
  • Decoupled evaluation strategy that isolates the effect of persuasion on evidence retrieval versus veracity classification separately
  • Empirical finding that Manipulative Wording techniques simultaneously degrade both evidence retrieval and classification, causing near-total AFC pipeline failure

🛡️ Threat Analysis

Input Manipulation Attack

Crafts adversarial natural language inputs at inference time that cause AFC classifiers to misclassify false claims as true — a textbook evasion/input-manipulation attack on NLP classification systems. The attack generates adversarial text (via persuasion techniques applied by an LLM) rather than gradient-based perturbations, but the core threat is misclassification caused by adversarially manipulated inputs.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxinference_timetargeted
Datasets
FEVERFEVEROUS
Applications
automated fact-checkingclaim verificationevidence retrieval