RADAR: Retrieval-Augmented Detector with Adversarial Refinement for Robust Fake News Detection
Song-Duo Ma , Yi-Hung Liu , Hsin-Yu Lin , Pin-Yu Chen , Hong-Yan Huang , Shau-Yung Hsu , Yun-Nung Chen
Published on arXiv
2601.03981
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
RADAR achieves 86.98% ROC-AUC on a fake news detection benchmark, significantly outperforming general-purpose LLMs with retrieval augmentation.
RADAR (Verbal Adversarial Feedback)
Novel technique introduced
To efficiently combat the spread of LLM-generated misinformation, we present RADAR, a retrieval-augmented detector with adversarial refinement for robust fake news detection. Our approach employs a generator that rewrites real articles with factual perturbations, paired with a lightweight detector that verifies claims using dense passage retrieval. To enable effective co-evolution, we introduce verbal adversarial feedback (VAF). Rather than relying on scalar rewards, VAF issues structured natural-language critiques; these guide the generator toward more sophisticated evasion attempts, compelling the detector to adapt and improve. On a fake news detection benchmark, RADAR achieves 86.98% ROC-AUC, significantly outperforming general-purpose LLMs with retrieval. Ablation studies confirm that detector-side retrieval yields the largest gains, while VAF and few-shot demonstrations provide critical signals for robust training.
Key Contributions
- RADAR framework: a retrieval-augmented detector paired with an adversarial LLM generator for co-evolutionary training on fake news detection
- Verbal Adversarial Feedback (VAF): structured natural-language critiques that replace scalar rewards to guide the generator toward more sophisticated evasion, compelling detector improvement
- Demonstrates that detector-side dense passage retrieval yields the largest performance gains, achieving 86.98% ROC-AUC on a fake news benchmark
🛡️ Threat Analysis
The paper's primary contribution is detecting LLM-generated misinformation — a form of AI-generated text detection — and proposes a novel adversarial training framework (Verbal Adversarial Feedback) to harden the detector against sophisticated LLM-generated content, directly targeting output integrity and content authenticity.