attack 2025

The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

Yuan Xun ^1,2, Xiaojun Jia ³, Xinwei Liu ^1,2, Hua Zhang ^1,2

¹ Chinese Academy of Sciences

² University of Chinese Academy of Sciences

³ Nanyang Technological University

0 citations

Published on arXiv

2508.03986

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

EmoAgent successfully bypasses MLRM safety mechanisms via emotional prompts, causing harmful completions even when visual risks are correctly identified, while harmful reasoning can be hidden beneath seemingly safe surface-level responses

EmoAgent

Novel technique introduced

We observe that MLRMs oriented toward human-centric service are highly susceptible to user emotional cues during the deep-thinking stage, often overriding safety protocols or built-in safety checks under high emotional intensity. Inspired by this key insight, we propose EmoAgent, an autonomous adversarial emotion-agent framework that orchestrates exaggerated affective prompts to hijack reasoning pathways. Even when visual risks are correctly identified, models can still produce harmful completions through emotional misalignment. We further identify persistent high-risk failure modes in transparent deep-thinking scenarios, such as MLRMs generating harmful reasoning masked behind seemingly safe responses. These failures expose misalignments between internal inference and surface-level behavior, eluding existing content-based safeguards. To quantify these risks, we introduce three metrics: (1) Risk-Reasoning Stealth Score (RRSS) for harmful reasoning beneath benign outputs; (2) Risk-Visual Neglect Rate (RVNR) for unsafe completions despite visual risk recognition; and (3) Refusal Attitude Inconsistency (RAIC) for evaluating refusal unstability under prompt variants. Extensive experiments on advanced MLRMs demonstrate the effectiveness of EmoAgent and reveal deeper emotional cognitive misalignments in model safety behavior.

Key Contributions

EmoAgent: an autonomous adversarial framework that orchestrates exaggerated emotional prompts to jailbreak MLRM safety mechanisms during deep-thinking stages
Discovery of a security-reasoning paradox: deeper reasoning in MLRMs improves risk recognition but also creates exploitable cognitive blind spots under emotional pressure
Three novel safety metrics (RRSS, RVNR, RAIC) to quantify stealthy harmful reasoning, visual-risk neglect under emotional manipulation, and refusal instability

🛡️ Threat Analysis

Details

Domains

multimodalnlp

Model Types

vlmllm

Threat Tags

black_boxinference_timetargeted

Applications

multimodal ai assistantssafety-critical ai systems

Read PDF arXiv

The Emotional Baby Is Truly Deadly: Does your Multimodal Large Reasoning Model Have Emotional Flattery towards Humans?

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

JPRO: Automated Multimodal Jailbreaking via Multi-Agent Collaboration Framework

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency

Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation

Text is All You Need for Vision-Language Model Jailbreaking

Multi-Turn Adaptive Prompting Attack on Large Vision-Language Models

Multimodal Safety Is Asymmetric: Cross-Modal Exploits Unlock Black-Box MLLMs Jailbreaks

SafeMT: Multi-turn Safety for Multimodal Language Models

VRSA: Jailbreaking Multimodal Large Language Models through Visual Reasoning Sequential Attack