defense 2026

Relationship-Aware Safety Unlearning for Multimodal LLMs

Vishnu Narayanan Anilkumar , Abhijith Sreesylesh Babu , Trieu Hai Vo , Mohankrishna Kolla , Alexander Cuneo

0 citations

α

Published on arXiv

2603.14185

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Suppresses unsafe O-R-O tuples (e.g., child-drinking-wine) while preserving benign uses of the same objects and relations, with demonstrated robustness against prompt obfuscation

Relationship-Aware Safety Unlearning

Novel technique introduced


Generative multimodal models can exhibit safety failures that are inherently relational: two benign concepts can become unsafe when linked by a specific action or relation (e.g., child-drinking-wine). Existing unlearning and concept-erasure approaches often target isolated concepts or image-text pairs, which can cause collateral damage to benign uses of the same objects and relations. We propose relationship-aware safety unlearning: a framework that explicitly represents unsafe object-relation-object (O-R-O) tuples and applies targeted parameter-efficient edits (LoRA) to suppress unsafe tuples while preserving object marginals and safe neighboring relations. We include CLIP-based experiments and robustness evaluation under paraphrase, contextual, and out-of-distribution image attacks.


Key Contributions

  • Formalizes unsafe relational tuples as object-relation-object (O-R-O) schema for contextual safety violations
  • Proposes relationship-aware unlearning using parameter-efficient LoRA edits that suppress unsafe tuples while preserving safe marginals
  • Evaluates robustness against paraphrase, contextual, and out-of-distribution image attacks on CLIP-based multimodal models

🛡️ Threat Analysis

Prompt Injection

Paper addresses safety failures in multimodal LLMs where unsafe content is generated through specific relational prompts (e.g., child-drinking-wine). The unlearning method is designed to suppress unsafe relational tuples and is evaluated for robustness against prompt obfuscation, paraphrases, and compositional adversaries — these are forms of prompt manipulation to elicit unsafe outputs. The defense directly targets LLM safety alignment.


Details

Domains
multimodalnlp
Model Types
llmvlmmultimodaltransformer
Threat Tags
training_timeinference_time
Applications
multimodal content generationvision-language reasoningtext-to-image generation