defense 2025

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Fatmazohra Rezkellah ¹, Ramzi Dakhmouche ^2,3

¹ Université Paris-Dauphine

² EPFL

³ Empa

2 citations · 24 references · arXiv

Published on arXiv

2510.03567

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Pointwise constraint-based weight intervention outperforms max-min adversarial formulations and state-of-the-art jailbreak defenses while incurring lower computational cost.

Constrained Model Interventions (TSR — Towards Safer Regions)

Novel technique introduced

With the increasing adoption of Large Language Models (LLMs), more customization is needed to ensure privacy-preserving and safe generation. We address this objective from two critical aspects: unlearning of sensitive information and robustness to jail-breaking attacks. We investigate various constrained optimization formulations that address both aspects in a \emph{unified manner}, by finding the smallest possible interventions on LLM weights that either make a given vocabulary set unreachable or embed the LLM with robustness to tailored attacks by shifting part of the weights to a \emph{safer} region. Beyond unifying two key properties, this approach contrasts with previous work in that it doesn't require an oracle classifier that is typically not available or represents a computational overhead. Surprisingly, we find that the simplest point-wise constraint-based intervention we propose leads to better performance than max-min interventions, while having a lower computational cost. Comparison against state-of-the-art defense methods demonstrates superior performance of the proposed approach.

Key Contributions

Unified constrained optimization framework that simultaneously addresses LLM jailbreak robustness and sensitive content unlearning via minimal weight perturbations
Elimination of the need for oracle classifiers by introducing a continuous relaxation over prompt space with direct concept embedding constraints
Demonstration that a simple point-wise constraint intervention outperforms max-min adversarial formulations and state-of-the-art defenses at lower computational cost

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetraining_time

Applications

llm safetycontent moderationsensitive information unlearning

Read PDF arXiv DOI

Machine Unlearning Meets Adversarial Robustness via Constrained Interventions on LLMs

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

IH-Challenge: A Training Dataset to Improve Instruction Hierarchy on Frontier LLMs

Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning

Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Dual-Space Smoothness for Robust and Balanced LLM Unlearning

Sanitize Your Responses: Mitigating Privacy Leakage in Large Language Models

Activation-Space Anchored Access Control for Multi-Class Permission Reasoning in Large Language Models

DUET: Distilled LLM Unlearning from an Efficiently Contextualized Teacher

NeuroFilter: Privacy Guardrails for Conversational LLM Agents