Relationship-Aware Safety Unlearning for Multimodal LLMs
Vishnu Narayanan Anilkumar , Abhijith Sreesylesh Babu , Trieu Hai Vo , Mohankrishna Kolla , Alexander Cuneo
Published on arXiv
2603.14185
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Suppresses unsafe O-R-O tuples (e.g., child-drinking-wine) while preserving benign uses of the same objects and relations, with demonstrated robustness against prompt obfuscation
Relationship-Aware Safety Unlearning
Novel technique introduced
Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations. We propose relationship-aware safety unlearning: a framework that explicitly represents unsafe object-relation-object (O-R-O) tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations. We include CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.
Key Contributions
- Formalizes unsafe relational tuples as object-relation-object (O-R-O) schema for contextual safety violations
- Proposes relationship-aware unlearning using parameter-efficient LoRA edits that suppress unsafe tuples while preserving safe marginals
- Evaluates robustness against paraphrase, contextual, and out-of-distribution image attacks on CLIP-based multimodal models
🛡️ Threat Analysis
Paper addresses safety failures in multimodal LLMs where unsafe content is generated through specific relational prompts (e.g., child-drinking-wine). The unlearning method is designed to suppress unsafe relational tuples and is evaluated for robustness against prompt obfuscation, paraphrases, and compositional adversaries — these are forms of prompt manipulation to elicit unsafe outputs. The defense directly targets LLM safety alignment.