Latest papers

7 papers
defense arXiv Mar 3, 2026 · 4w ago

Conditioned Activation Transport for T2I Safety Steering

Maciej Chrabąszcz, Aleksander Szymczyk, Jan Dubiński et al. · NASK National Research Institute · Warsaw University of Technology +3 more

Proposes conditioned activation transport to steer T2I model activations away from unsafe regions while preserving image quality

Prompt Injection visionmultimodalgenerative
PDF Code
attack arXiv Feb 5, 2026 · 8w ago

REBEL: Hidden Knowledge Recovery via Evolutionary-Based Evaluation Loop

Patryk Rybak, Paweł Batorski, Paul Swoboda et al. · Jagiellonian University · Heinrich Heine Universität Düsseldorf +1 more

Evolutionary adversarial prompting attack recovers supposedly forgotten training data from unlearned LLMs with up to 93% success rate

Model Inversion Attack Sensitive Information Disclosure nlp
PDF Code
attack arXiv Jan 30, 2026 · 9w ago

ReLAPSe: Reinforcement-Learning-trained Adversarial Prompt Search for Erased concepts in unlearned diffusion models

Ignacy Kolton, Kacper Marzol, Paweł Batorski et al. · Jagiellonian University · Heinrich Heine Universität Düsseldorf +1 more

RL-trained adversarial prompt policy recovers erased concepts from unlearned diffusion models at near-real-time speed

Input Manipulation Attack generative
PDF Code
attack arXiv Dec 10, 2025 · Dec 2025

Membership and Dataset Inference Attacks on Large Audio Generative Models

Jakub Proboszcz, Paweł Kochanski, Karol Korszun et al. · Warsaw University of Technology · Sapienza University of Rome +2 more

Extends dataset inference attacks to audio generative models, showing DI succeeds at copyright verification where single-sample MIA fails

Membership Inference Attack audiogenerative
PDF
attack arXiv Nov 10, 2025 · Nov 2025

On Stealing Graph Neural Network Models

Marcin Podhajski, Jan Dubiński, Franziska Boenisch et al. · Polish Academy of Sciences · IDEAS NCBR +5 more

Steals GNN models with as few as 100 queries by decoupling query-free backbone extraction from strategic head extraction

Model Theft graph
PDF Code
defense arXiv Oct 9, 2025 · Oct 2025

Backdoor Vectors: a Task Arithmetic View on Backdoor Attacks and Defenses

Stanisław Pawlak, Jan Dubiński, Daniel Marczak et al. · Warsaw University of Technology · NASK National Research Institute +3 more

Proposes Backdoor Vectors to unify backdoor attacks in model merging, plus stronger SBV attack and assumption-free IBVS defense

Model Poisoning visionmultimodal
PDF
attack arXiv Sep 26, 2025 · Sep 2025

Memory Self-Regeneration: Uncovering Hidden Knowledge in Unlearned Models

Agnieszka Polowczyk, Alicja Polowczyk, Joanna Waczyńska et al. · Silesian University of Technology · Jagiellonian University +1 more

Attacks machine unlearning in text-to-image diffusion models via LoRA fine-tuning, recovering supposedly erased harmful concepts with few reference images

Model Inversion Attack visiongenerative
1 citations PDF Code