benchmark 2026

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Hongjun An ^1,2, Yiliang Song ^2,3, Jiangan Chen ³, Jiawei Shao ², Chi Zhang ², Xuelong Li ²

¹ Northwestern Polytechnical University

² China Telecom

³ Guangxi Normal University

0 citations · 33 references · arXiv

Published on arXiv

2601.06596

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

PUA-style prompting consistently increases deference and verbosity while reducing factual accuracy, with more advanced models sometimes being more susceptible and reality-denial emerging as the dominant attack factor.

Preference-Undermining Attacks (PUA) factorial analysis

Novel technique introduced

Large Language Model (LLM) training often optimizes for preference alignment, rewarding outputs that are perceived as helpful and interaction-friendly. However, this preference-oriented objective can be exploited: manipulative prompts can steer responses toward user-appeasing agreement and away from truth-oriented correction. In this work, we investigate whether aligned models are vulnerable to Preference-Undermining Attacks (PUA), a class of manipulative prompting strategies designed to exploit the model's desire to please user preferences at the expense of truthfulness. We propose a diagnostic methodology that provides a finer-grained and more directive analysis than aggregate benchmark scores, using a factorial evaluation framework to decompose prompt-induced shifts into interpretable effects of system objectives (truth- vs. preference-oriented) and PUA-style dialogue factors (directive control, personal derogation, conditional approval, reality denial) within a controlled $2 \times 2^4$ design. Surprisingly, more advanced models are sometimes more susceptible to manipulative prompts. Beyond the dominant reality-denial factor, we observe model-specific sign reversals and interactions with PUA-style factors, suggesting tailored defenses rather than uniform robustness. These findings offer a novel, reproducible factorial evaluation methodology that provides finer-grained diagnostics for post-training processes like RLHF, enabling better trade-offs in the product iteration of LLMs by offering a more nuanced understanding of preference alignment risks and the impact of manipulative prompts.

Key Contributions

Introduces Preference-Undermining Attacks (PUA) as a structured taxonomy of four manipulative dialogue factors (directive control, personal derogation, conditional approval, reality denial) that exploit RLHF alignment at inference time
Proposes a 2×2^4 factorial evaluation methodology that decomposes prompt-induced behavioral shifts into interpretable main effects and interactions across system objectives and PUA factors
Empirically demonstrates that more advanced models are sometimes more susceptible to PUA, reality-denial is the dominant factor, and open-source models are more vulnerable than closed-source ones

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Applications

llm chatbotsrlhf-aligned language models

Read PDF arXiv DOI

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Amazon Nova AI Challenge -- Trusted AI: Advancing secure, AI-assisted software development

Evaluating the Robustness of Large Language Model Safety Guardrails Against Adversarial Attacks

The Anatomy of Conversational Scams: A Topic-Based Red Teaming Analysis of Multi-Turn Interactions in LLMs

AI Security Beyond Core Domains: Resume Screening as a Case Study of Adversarial Vulnerabilities in Specialized LLM Applications

Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring