defense 2025

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Fatmazohra Rezkellah 1, Ramzi Dakhmouche 2,3

2 citations · 24 references · arXiv

α

Published on arXiv

2510.03567

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Pointwise constraint-based weight intervention outperforms max-min adversarial formulations and state-of-the-art jailbreak defenses while incurring lower computational cost.

Constrained Model Interventions (TSR — Towards Safer Regions)

Novel technique introduced


With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.


Key Contributions

  • Unified constrained optimization framework that simultaneously addresses LLM jailbreak robustness and sensitive content unlearning via minimal weight perturbations
  • Elimination of the need for oracle classifiers by introducing a continuous relaxation over prompt space with direct concept embedding constraints
  • Demonstration that a simple point-wise constraint intervention outperforms max-min adversarial formulations and state-of-the-art defenses at lower computational cost

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetraining_time
Applications
llm safetycontent moderationsensitive information unlearning