defense 2026

RePAIR: Interactive Machine Unlearning through Prompt-Aware Model Repair

Jagadeesh Rachapudi , Pranav Singh , Ritali Vatsi , Praful Hambarde , Amit Shukla

0 citations

α

Published on arXiv

2604.12820

Model Inversion Attack

OWASP ML Top 10 — ML03

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six baselines

RePAIR

Novel technique introduced


Large language models (LLMs) inherently absorb harmful knowledge, misinformation, and personal data during pretraining on large-scale web corpora, with no native mechanism for selective removal. While machine unlearning offers a principled solution, existing approaches are provider-centric, requiring retraining pipelines, curated retain datasets, and direct intervention by model service providers (MSPs), thereby excluding end users from controlling their own data. We introduce Interactive Machine Unlearning (IMU), a new paradigm in which users can instruct LLMs to forget targeted knowledge through natural language at inference time. To realize IMU, we propose RePAIR, a prompt-aware model repair framework comprising (i) a watchdog model for unlearning intent detection, (ii) a surgeon model for generating repair procedures, and (iii) a patient model whose parameters are updated autonomously. At the core of RePAIR, we develop Steering Through Activation Manipulation with PseudoInverse (STAMP), a training-free, single-sample unlearning method that redirects MLP activations toward a refusal subspace via closed-form pseudoinverse updates. Its low-rank variant reduces computational complexity from O(d^3) to O(r^3 + r^2 * d), enabling efficient on-device unlearning with up to ~3x speedup over training-based baselines. Extensive experiments across harmful knowledge suppression, misinformation correction, and personal data erasure demonstrate that RePAIR achieves near-zero forget scores (Acc_f = 0.00, F-RL = 0.00) while preserving model utility (Acc_r up to 84.47, R-RL up to 0.88), outperforming six state-of-the-art baselines. These results establish RePAIR as an effective and practical framework for user-driven model editing, advancing transparent and on-device control over learned knowledge, with potential extensions to multimodal foundation models.


Key Contributions

  • Interactive Machine Unlearning (IMU) paradigm enabling users to instruct LLMs to forget knowledge via natural language prompts
  • STAMP (Steering Through Activation Manipulation with PseudoInverse): training-free, single-sample unlearning via closed-form MLP activation redirection
  • Low-rank variant reducing computational complexity from O(d^3) to O(r^3 + r^2*d) with ~3x speedup for on-device unlearning

🛡️ Threat Analysis

Model Inversion Attack

Primary focus is erasing private training data (personal information, harmful knowledge) from LLMs with quantitative evaluation using forget scores (Acc_f, F-RL) — this addresses training data extraction/memorization threats.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetraining_time
Applications
harmful knowledge suppressionmisinformation correctionpersonal data erasure