defense 2025

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Liang Shan ¹, Kaicheng Shen ¹, Wen Wu ¹, Zhenyu Ying ¹, Chaochao Lu ², Yan Teng ², Jingqi Huang ¹, Guangze Ye ¹, Guoqing Wang ¹, Liang He ¹

¹ East China Normal University

² Shanghai AI Lab

1 citations · 39 references · arXiv (Cornell University)

Published on arXiv

2511.07107

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MENTOR reduces average jailbreak success rate from 57.8% to 4.6% across 14 LLMs, with activation steering alone contributing a 50.1% reduction tuning-free.

MENTOR

Novel technique introduced

Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset of 3,000 annotated queries spanning education, finance, and management. Evaluations across 14 leading LLMs reveal a concerning vulnerability: an average jailbreak success rate of 57.8%. In response, we propose MENTOR, a metacognition-driven self-evolution framework. MENTOR first performs structured self-assessment through simulated critical thinking, such as perspective-taking and consequential reasoning to uncover latent model misalignments. These reflections are formalized into dynamic rule-based knowledge graphs that evolve with emerging risk patterns. To enforce these rules at inference time, we introduce activation steering, a method that directly modulates the model's internal representations to ensure compliance. Experiments demonstrate that MENTOR substantially reduces attack success rates across all tested domains and achieves risk analysis performance comparable to human experts. Our work offers a scalable and adaptive pathway toward robust domain-specific alignment of LLMs.

Key Contributions

A 3,000-query domain-specific implicit risk benchmark across education, finance, and management, evaluated on 14 LLMs revealing a 57.8% average jailbreak success rate
A metacognition-driven self-assessment mechanism using perspective-taking and consequential reasoning to uncover latent LLM misalignments and convert them into dynamic rule graphs
An activation steering module that directly modulates internal model representations at inference time to enforce domain safety rules without retraining, reducing jailbreak success rate from 57.8% to 4.6%

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Datasets

Custom 3,000-query domain-specific implicit risk benchmark

Applications

llm safety alignmenteducationfinancemanagement

Read PDF arXiv DOI

MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Feature-Guided SAE Steering for Refusal-Rate Control using Contrasting Prompts

GAVEL: Towards rule-based safety through activation monitoring

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

Safe Transformer: An Explicit Safety Bit For Interpretable And Controllable Alignment

STAR-S: Improving Safety Alignment through Self-Taught Reasoning on Safety Rules

A Lightweight Explainable Guardrail for Prompt Safety

Trust The Typical

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations