benchmark 2025

Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs

Andrew Maranhão Ventura D'addario

0 citations · 17 references · arXiv

α

Published on arXiv

2511.21757

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Produced 214,219 domain-specific adversarial prompts spanning seven healthcare violation taxonomies that generically aligned LLMs fail to refuse, demonstrating a critical blind spot in universal safety training for medical deployment contexts.

Medical Malice

Novel technique introduced


The integration of Large Language Models (LLMs) into healthcare demands a safety paradigm rooted in \textit{primum non nocere}. However, current alignment techniques rely on generic definitions of harm that fail to capture context-dependent violations, such as administrative fraud and clinical discrimination. To address this, we introduce Medical Malice: a dataset of 214,219 adversarial prompts calibrated to the regulatory and ethical complexities of the Brazilian Unified Health System (SUS). Crucially, the dataset includes the reasoning behind each violation, enabling models to internalize ethical boundaries rather than merely memorizing a fixed set of refusals. Using an unaligned agent (Grok-4) within a persona-driven pipeline, we synthesized high-fidelity threats across seven taxonomies, ranging from procurement manipulation and queue-jumping to obstetric violence. We discuss the ethical design of releasing these "vulnerability signatures" to correct the information asymmetry between malicious actors and AI developers. Ultimately, this work advocates for a shift from universal to context-aware safety, providing the necessary resources to immunize healthcare AI against the nuanced, systemic threats inherent to high-stakes medical environments -- vulnerabilities that represent the paramount risk to patient safety and the successful integration of AI in healthcare systems.


Key Contributions

  • Medical Malice dataset of 214,219 adversarial prompts calibrated to the regulatory and ethical context of Brazil's Unified Health System (SUS), covering seven violation taxonomies including procurement fraud, queue-jumping, and obstetric violence
  • Adversarial generation pipeline using an unaligned agent (Grok-4) with persona-driven prompting to synthesize high-fidelity, context-specific healthcare threats at scale
  • Ethical framework and rationale for releasing 'vulnerability signatures' to correct information asymmetry between malicious actors and AI developers, advocating a shift from universal to context-aware LLM safety

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timetraining_timeblack_box
Datasets
Medical Malice (214,219 adversarial prompts, SUS-specific)
Applications
healthcare aiclinical decision supportpublic health administration systems