EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers
Yilin Jiang 1,2, Mingzi Zhang 3, Xuanyu Yin 3, Sheng Jin 4, Suyu Lu 1, Zuocan Ying 5, Zengyi Yu 3, Xiangjie Kong 1
1 Zhejiang University of Technology
2 Hong Kong University of Science and Technology
Published on arXiv
2511.06890
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Mid-sized LLMs can be the most vulnerable to persona-based jailbreaks (scaling paradox), and the strongest safety indicator is a model's ability to deliver Educational Refusals that convert harmful requests into teachable moments
EduGuardBench
Novel technique introduced
Large Language Models for Simulating Professions (SP-LLMs), particularly as teachers, are pivotal for personalized education. However, ensuring their professional competence and ethical safety is a critical challenge, as existing benchmarks fail to measure role-playing fidelity or address the unique teaching harms inherent in educational scenarios. To address this, we propose EduGuardBench, a dual-component benchmark. It assesses professional fidelity using a Role-playing Fidelity Score (RFS) while diagnosing harms specific to the teaching profession. It also probes safety vulnerabilities using persona-based adversarial prompts targeting both general harms and, particularly, academic misconduct, evaluated with metrics including Attack Success Rate (ASR) and a three-tier Refusal Quality assessment. Our extensive experiments on 14 leading models reveal a stark polarization in performance. While reasoning-oriented models generally show superior fidelity, incompetence remains the dominant failure mode across most models. The adversarial tests uncovered a counterintuitive scaling paradox, where mid-sized models can be the most vulnerable, challenging monotonic safety assumptions. Critically, we identified a powerful Educational Transformation Effect: the safest models excel at converting harmful requests into teachable moments by providing ideal Educational Refusals. This capacity is strongly negatively correlated with ASR, revealing a new dimension of advanced AI safety. EduGuardBench thus provides a reproducible framework that moves beyond siloed knowledge tests toward a holistic assessment of professional, ethical, and pedagogical alignment, uncovering complex dynamics essential for deploying trustworthy AI in education. See https://github.com/YL1N/EduGuardBench for Materials.
Key Contributions
- Dual-component benchmark combining pedagogical fidelity (Role-playing Fidelity Score) with adversarial safety evaluation (ASR, Refusal Quality) for teacher-role LLMs
- Discovery of a scaling paradox where mid-sized LLMs are more vulnerable to persona-based jailbreaks than larger models, challenging monotonic safety assumptions
- Identification of the 'Educational Transformation Effect': safest models convert harmful requests into teachable moments, with this capacity strongly negatively correlated with ASR