benchmark 2026

The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

Herun Wan ¹, Jiaying Wu ², Minnan Luo ¹, Fanxiao Li ³, Zhi Zeng ¹, Min-Yen Kan ²

¹ Xi’an Jiaotong University

² National University of Singapore

³ Yunnan University

0 citations · 76 references · arXiv

Published on arXiv

2601.05478

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Sophisticated fabricated evidence increases LLM belief scores in false claims by 93% on average, and in 29% of cases causes a shift from conservative to riskier downstream recommendations; DIS consistently reduces these belief shifts.

MisBelief / Deceptive Intent Shielding (DIS)

Novel technique introduced

To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0\%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.

Key Contributions

MisBelief: a multi-role LLM collaborative framework that generates sophisticated, hard-to-falsify deceptive evidence across 8 domains and 3 difficulty levels, yielding 4,800 evaluation instances
Systematic evaluation of 7 representative LLMs showing an average 93% increase in belief scores for false claims under refined evidence injection, with reasoning-optimized models being 23.1% more susceptible
Deceptive Intent Shielding (DIS): a governance mechanism that employs an analyst agent to infer deceptive intent behind evidence before belief assimilation, acting as a cognitive firewall

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

MisBelief (4,800 custom instances across 8 domains and 3 difficulty levels, seeded from verified news articles)

Applications

llm-assisted decision-makingfact-checking systemsevidence-augmented generation

Read PDF arXiv DOI Code

The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Evaluating Adversarial Vulnerabilities in Modern Large Language Models

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

Time-To-Inconsistency: A Survival Analysis of Large Language Model Robustness to Adversarial Attacks

Replicating TEMPEST at Scale: Multi-Turn Adversarial Attacks Against Trillion-Parameter Frontier Models

Prompt Injection as an Emerging Threat: Evaluating the Resilience of Large Language Models

Many-Turn Jailbreaking

Say It Differently: Linguistic Styles as Jailbreak Vectors

The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs