tool 2026

MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

José Pombal ^1,2,3, Maya D'Eon ¹, Nuno M. Guerreiro ¹, Pedro Henrique Martins ¹, António Farinhas ¹, Ricardo Rei ¹

¹ Sword Health

² Instituto de Telecomunicações

³ Instituto Superior Técnico

0 citations · 24 references · arXiv (Cornell University)

Published on arXiv

2602.00950

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

MindGuard classifiers, when paired with clinician LMs, achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards while reducing false positives

MindGuard

Novel technique introduced

Large language models are increasingly used for mental health support, yet their conversational coherence alone does not ensure clinical appropriateness. Existing general-purpose safeguards often fail to distinguish between therapeutic disclosures and genuine clinical crises, leading to safety failures. To address this gap, we introduce a clinically grounded risk taxonomy, developed in collaboration with PhD-level psychologists, that identifies actionable harm (e.g., self-harm and harm to others) while preserving space for safe, non-crisis therapeutic content. We release MindGuard-testset, a dataset of real-world multi-turn conversations annotated at the turn level by clinical experts. Using synthetic dialogues generated via a controlled two-agent setup, we train MindGuard, a family of lightweight safety classifiers (with 4B and 8B parameters). Our classifiers reduce false positives at high-recall operating points and, when paired with clinician language models, help achieve lower attack success and harmful engagement rates in adversarial multi-turn interactions compared to general-purpose safeguards. We release all models and human evaluation data.

Key Contributions

Clinically grounded risk taxonomy for mental health LLMs, co-developed with PhD-level psychologists, distinguishing actionable harm from safe therapeutic disclosures
MindGuard-testset: real-world multi-turn mental health conversations annotated at the turn level by clinical experts
MindGuard family of lightweight safety classifiers (4B and 8B parameters) that reduce false positives at high-recall operating points and lower adversarial attack success rates versus general-purpose safeguards

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_box

Datasets

MindGuard-testset

Applications

mental health chatbotsllm safety guardrails

Read PDF arXiv DOI

MindGuard: Guardrail Classifiers for Multi-Turn Mental Health Support

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Proactive Hardening of LLM Defenses with HASTE

SGuard-v1: Safety Guardrail for Large Language Models

NeuroBreak: Unveil Internal Jailbreak Mechanisms in Large Language Models

Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models

Detecting Jailbreak Attempts in Clinical Training LLMs Through Automated Linguistic Feature Extraction

Quantifying Document Impact in RAG-LLMs

How Toxic Can You Get? Search-based Toxicity Testing for Large Language Models

ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants