Jiawei Shao

benchmark arXiv Jan 10, 2026 · 12w ago

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity

Hongjun An, Yiliang Song, Jiangan Chen et al. · Northwestern Polytechnical University · China Telecom +1 more

Factorial framework diagnoses how manipulative natural-language prompts exploit RLHF alignment to make LLMs prioritize sycophancy over factual accuracy

Prompt Injection nlp

PDF

Papers in Database (1)

Are LLMs Vulnerable to Preference-Undermining Attacks (PUA)? A Factorial Analysis Methodology for Diagnosing the Trade-off between Preference Alignment and Real-World Validity