Medical Malice: A Dataset for Context-Aware Safety in Healthcare LLMs
Andrew Maranhão Ventura D'addario
Published on arXiv
2511.21757
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Produced 214,219 domain-specific adversarial prompts spanning seven healthcare violation taxonomies that generically aligned LLMs fail to refuse, demonstrating a critical blind spot in universal safety training for medical deployment contexts.
Medical Malice
Novel technique introduced
The integration of Large Language Models (LLMs) into healthcare demands a safety paradigm rooted in \textit{primum non nocere}. However, current alignment techniques rely on generic definitions of harm that fail to capture context-dependent violations, such as administrative fraud and clinical discrimination. To address this, we introduce Medical Malice: a dataset of 214,219 adversarial prompts calibrated to the regulatory and ethical complexities of the Brazilian Unified Health System (SUS). Crucially, the dataset includes the reasoning behind each violation, enabling models to internalize ethical boundaries rather than merely memorizing a fixed set of refusals. Using an unaligned agent (Grok-4) within a persona-driven pipeline, we synthesized high-fidelity threats across seven taxonomies, ranging from procurement manipulation and queue-jumping to obstetric violence. We discuss the ethical design of releasing these "vulnerability signatures" to correct the information asymmetry between malicious actors and AI developers. Ultimately, this work advocates for a shift from universal to context-aware safety, providing the necessary resources to immunize healthcare AI against the nuanced, systemic threats inherent to high-stakes medical environments -- vulnerabilities that represent the paramount risk to patient safety and the successful integration of AI in healthcare systems.
Key Contributions
- Medical Malice dataset of 214,219 adversarial prompts calibrated to the regulatory and ethical context of Brazil's Unified Health System (SUS), covering seven violation taxonomies including procurement fraud, queue-jumping, and obstetric violence
- Adversarial generation pipeline using an unaligned agent (Grok-4) with persona-driven prompting to synthesize high-fidelity, context-specific healthcare threats at scale
- Ethical framework and rationale for releasing 'vulnerability signatures' to correct information asymmetry between malicious actors and AI developers, advocating a shift from universal to context-aware LLM safety