SafeMed-R1: Adversarial Reinforcement Learning for Generalizable and Robust Medical Reasoning in Vision-Language Models

Vision--Language Models (VLMs) show significant promise for Medical Visual Question Answering (VQA), yet their deployment in clinical settings is hindered by severe vulnerability to adversarial attacks. Standard adversarial training, while effective for simpler tasks, often degrades both generalization performance and the quality of generated clinical reasoning. We introduce SafeMed-R1, a hybrid defense framework that ensures robust performance while preserving high-quality, interpretable medical reasoning. SafeMed-R1 employs a two-stage approach: at training time, we integrate Adversarial Training with Group Relative Policy Optimization (AT-GRPO) to explicitly robustify the reasoning process against worst-case perturbations; at inference time, we augment the model with Randomized Smoothing to provide certified $L_2$-norm robustness guarantees. We evaluate SafeMed-R1 on the OmniMedVQA benchmark across eight medical imaging modalities comprising over 88,000 samples. Our experiments reveal that standard fine-tuned VLMs, despite achieving 95\% accuracy on clean inputs, collapse to approximately 25\% under PGD attacks. In contrast, SafeMed-R1 maintains 84.45\% accuracy under the same adversarial conditions, representing a 59 percentage point improvement in robustness. Furthermore, we demonstrate that models trained with explicit chain-of-thought reasoning exhibit superior adversarial robustness compared to instruction-only variants, suggesting a synergy between interpretability and security in medical AI systems.

Key Contributions

AT-GRPO: first framework combining adversarial training with Group Relative Policy Optimization to robustify VLM reasoning under worst-case perturbations without degrading chain-of-thought quality
Hybrid empirical + certified defense: AT-GRPO at training time paired with Randomized Smoothing at inference time for certified L2-norm robustness guarantees
Empirical finding that models trained with explicit chain-of-thought reasoning exhibit superior adversarial robustness, suggesting a synergy between interpretability and security

🛡️ Threat Analysis

Input Manipulation Attack

Proposes a defense (SafeMed-R1) against adversarial visual perturbations (PGD, FGSM, C&W) on VLMs at inference time; the core contribution is a novel adversarial training method (AT-GRPO) that robustifies the VLM reasoning process against worst-case input perturbations, supplemented by Randomized Smoothing for certified L2-norm robustness guarantees.

Details

Domains

visionmultimodal

Model Types

vlm

Threat Tags

white_boxtraining_timeinference_timedigital

Datasets

OmniMedVQA

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

Self-Calibrated Consistency can Fight Back for Adversarial Robustness in Vision-Language Models

DAC-LoRA: Dynamic Adversarial Curriculum for Efficient and Robust Few-Shot Adaptation

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

Contrastive Spectral Rectification: Test-Time Defense towards Zero-shot Adversarial Robustness of CLIP

Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Semantic-aware Adversarial Fine-tuning for CLIP

DualTAP: A Dual-Task Adversarial Protector for Mobile MLLM Agents