Relationship-Aware Safety Unlearning for Multimodal LLMs

Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations. We propose relationship-aware safety unlearning: a framework that explicitly represents unsafe object-relation-object (O-R-O) tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations. We include CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.

Key Contributions

Formalizes unsafe relational tuples as object-relation-object (O-R-O) schema for contextual safety violations
Proposes relationship-aware unlearning using parameter-efficient LoRA edits that suppress unsafe tuples while preserving safe marginals
Evaluates robustness against paraphrase, contextual, and out-of-distribution image attacks on CLIP-based multimodal models

🛡️ Threat Analysis

Prompt Injection

Paper addresses safety failures in multimodal LLMs where unsafe content is generated through specific relational prompts (e.g., child-drinking-wine). The unlearning method is designed to suppress unsafe relational tuples and is evaluated for robustness against prompt obfuscation, paraphrases, and compositional adversaries — these are forms of prompt manipulation to elicit unsafe outputs. The defense directly targets LLM safety alignment.