ML Security Papers

Latest papers

5 papers

attack arXiv Feb 5, 2026 · 8w ago

REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

Patryk Rybak, Paweł Batorski, Paul Swoboda et al. · Jagiellonian University · Heinrich Heine Universität Düsseldorf +1 more

Evolutionary adversarial prompting attack recovers supposedly forgotten training data from unlearned LLMs with up to 93% success rate

Model Inversion Attack Sensitive Information Disclosure nlp

PDF Code

attack arXiv Jan 30, 2026 · 9w ago

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Ignacy Kolton, Kacper Marzol, Paweł Batorski et al. · Jagiellonian University · Heinrich Heine Universität Düsseldorf +1 more

RL-trained adversarial prompt policy recovers erased concepts from unlearned diffusion models at near-real-time speed

Input Manipulation Attack generative

PDF Code

defense arXiv Nov 21, 2025 · Nov 2025

AEGIS: Preserving privacy of 3D Facial Avatars with Adversarial Perturbations

Dawid Wolkiewicz, Anastasiya Pechko, Przemysław Spurek et al. · Wroclaw University of Science and Technology · Jagiellonian University

Defends 3D facial avatar identity by applying viewpoint-consistent PGD adversarial perturbations that reduce face verification accuracy to 0%

Input Manipulation Attack visiongenerative

PDF Code

attack arXiv Sep 26, 2025 · Sep 2025

Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

Agnieszka Polowczyk, Alicja Polowczyk, Joanna Waczyńska et al. · Silesian University of Technology · Jagiellonian University +1 more

Attacks machine unlearning in text-to-image diffusion models via LoRA fine-tuning, recovering supposedly erased harmful concepts with few reference images

Model Inversion Attack visiongenerative

1 citations PDF Code

defense arXiv Sep 15, 2025 · Sep 2025

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Filip Sondej, Yushi Yang · Jagiellonian University · University of Oxford

Proposes CIR to robustly remove bio/cyber-hazardous knowledge from LLMs, resisting adversarial fine-tuning and jailbreak recovery attacks

Prompt Injection nlp

PDF Code

Latest papers

REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

AEGIS: Preserving privacy of 3D Facial Avatars with Adversarial Perturbations

Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue