tool 2026

MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

José Pombal 1,2,3, Maya D'Eon 1, Nuno M. Guerreiro 1, Pedro Henrique Martins 1, António Farinhas 1, Ricardo Rei 1

0 citations · 24 references · arXiv (Cornell University)

α

Published on arXiv

2602.00950

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MindGuard classifiers, when paired with clinician LMs, achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards while reducing false positives

MindGuard

Novel technique introduced


Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.


Key Contributions

  • Clinically grounded risk taxonomy for mental health LLMs, co-developed with PhD-level psychologists, distinguishing actionable harm from safe therapeutic disclosures
  • MindGuard-testset: real-world multi-turn mental health conversations annotated at the turn level by clinical experts
  • MindGuard family of lightweight safety classifiers (4B and 8B parameters) that reduce false positives at high-recall operating points and lower adversarial attack success rates versus general-purpose safeguards

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_box
Datasets
MindGuard-testset
Applications
mental health chatbotsllm safety guardrails