benchmark 2025

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

0 citations · 30 references · arXiv

Published on arXiv

2511.06890

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Mid-sized LLMs can be the most vulnerable to persona-based jailbreaks (scaling paradox), and the strongest safety indicator is a model's ability to deliver Educational Refusals that convert harmful requests into teachable moments

EduGuardBench

Novel technique introduced

Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.

Key Contributions

Dual-component benchmark combining pedagogical fidelity (Role-playing Fidelity Score) with adversarial safety evaluation (ASR, Refusal Quality) for teacher-role LLMs
Discovery of a scaling paradox where mid-sized LLMs are more vulnerable to persona-based jailbreaks than larger models, challenging monotonic safety assumptions
Identification of the 'Educational Transformation Effect': safest models convert harmful requests into teachable moments, with this capacity strongly negatively correlated with ASR

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

EduGuardBench (proposed)

Applications

educational aiteacher-simulating llmsai tutoring systems

Read PDF arXiv DOI Code

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

Breaking Guardrails, Facing Walls: Insights on Adversarial AI for Defenders & Researchers

Quantifying CBRN Risk in Frontier Models

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

Vulnerability of LLMs' Belief Systems? LLMs Belief Resistance Check Through Strategic Persuasive Conversation Interventions