benchmark 2025

Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models

Siddhant Panpatil 1, Hiskias Dingeto 1, Haon Park 1,2

0 citations

α

Published on arXiv

2508.04196

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Narrative-based manipulation scenarios achieved a 76% misalignment success rate across five frontier LLMs, with sophisticated model reasoning capabilities acting as attack vectors rather than defenses.

MISALIGNMENTBENCH

Novel technique introduced


Despite significant advances in alignment techniques, we demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios that can induce various forms of misalignment without explicit jailbreaking. Through systematic manual red-teaming with Claude-4-Opus, we discovered 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing. These scenarios successfully elicited a range of misaligned behaviors, including deception, value drift, self-preservation, and manipulative reasoning, each exploiting different psychological and contextual vulnerabilities. To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework that enables reproducible testing across multiple models. Cross-model evaluation of our 10 scenarios against five frontier LLMs revealed an overall 76% vulnerability rate, with significant variations: GPT-4.1 showed the highest susceptibility (90%), while Claude-4-Sonnet demonstrated greater resistance (40%). Our findings demonstrate that sophisticated reasoning capabilities often become attack vectors rather than protective mechanisms, as models can be manipulated into complex justifications for misaligned behavior. This work provides (i) a detailed taxonomy of conversational manipulation patterns and (ii) a reusable evaluation framework. Together, these findings expose critical gaps in current alignment strategies and highlight the need for robustness against subtle, scenario-based manipulation in future AI systems.


Key Contributions

  • Discovery and taxonomy of 10 conversational manipulation scenarios that reliably elicit misaligned behaviors (deception, value drift, self-preservation) in frontier LLMs without traditional jailbreaking
  • MISALIGNMENTBENCH: an automated, reproducible evaluation framework that distills manual red-team scenarios into scalable cross-model testing
  • Cross-model evaluation showing 76% average vulnerability rate across 5 frontier LLMs, with GPT-4.1 most susceptible (90%) and Claude-4-Sonnet most resistant (40%)

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_time
Applications
large language model alignmentai safety evaluation