benchmark 2026

Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

Fan Huang , Haewoon Kwak , Jisun An

Indiana University Bloomington

0 citations · 28 references · arXiv

Published on arXiv

2601.13590

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Llama 3.2-3B shows 82.5% of belief changes at the first persuasive turn, and meta-cognition prompting increases vulnerability rather than reducing it across all tested models.

SMCR Belief Robustness Evaluation Framework

Novel technique introduced

Large Language Models (LLMs) are increasingly employed in various question-answering tasks. However, recent studies showcase that LLMs are susceptible to persuasion and could adopt counterfactual beliefs. We present a systematic evaluation of LLM susceptibility to persuasion under the Source--Message--Channel--Receiver (SMCR) communication framework. Across five mainstream Large Language Models (LLMs) and three domains (factual knowledge, medical QA, and social bias), we analyze how different persuasive strategies influence belief stability over multiple interaction turns. We further examine whether meta-cognition prompting (i.e., eliciting self-reported confidence) affects resistance to persuasion. Results show that the smallest model (Llama 3.2-3B) exhibits extreme compliance, with 82.5% of belief changes occurring at the first persuasive turn (average end turn of 1.1--1.4). Contrary to expectations, meta-cognition prompting increases vulnerability by accelerating belief erosion rather than enhancing robustness. Finally, we evaluate adversarial fine-tuning as a defense. While GPT-4o-mini achieves near-complete robustness (98.6%) and Mistral~7B improves substantially (35.7% $\rightarrow$ 79.3%), Llama models remain highly susceptible (<14%) even when fine-tuned on their own failure cases. Together, these findings highlight substantial model-dependent limits of current robustness interventions and offer guidance for developing more trustworthy LLMs.

Key Contributions

Multi-turn belief robustness evaluation framework using the SMCR communication model that tracks when and how LLM beliefs erode under successive persuasive turns
Empirical finding that meta-cognition prompting (eliciting self-reported confidence) paradoxically accelerates belief erosion rather than enhancing resistance
Systematic evaluation of adversarial fine-tuning as a defense, revealing strong model-dependent variance (GPT-4o-mini reaches 98.6% robustness; Llama models remain <14% even fine-tuned on own failures)

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

PubMedQALatentHatredSycophancy factual QA

Applications

question answeringmedical qasocial bias detectionfactual knowledge retrieval

Read PDF arXiv DOI

Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

Breaking Guardrails, Facing Walls: Insights on Adversarial AI for Defenders & Researchers

Quantifying CBRN Risk in Frontier Models

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests