defense 2026

When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

0 citations

Published on arXiv

2603.04731

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

Existing unlearnable example methods are effectively bypassed by pretrained backbones; BAIT's mislabel-perturbation binding restores unlearnability across multiple pretrained architectures where prior methods fail.

BAIT (Binding Artificial perturbations to Incorrect Targets)

Novel technique introduced

Unlearnable Examples (UEs) serve as a data protection strategy that generates imperceptible perturbations to mislead models into learning spurious correlations instead of underlying semantics. In this paper, we uncover a fundamental vulnerability of UEs that emerges when learning starts from a pretrained model. Crucially, our empirical analysis shows that even when data are protected by carefully crafted perturbations, pretraining priors still furnish rich semantic representations that allow the model to circumvent the shortcuts introduced by UEs and capture genuine features, thereby nullifying unlearnability. To address this, we propose BAIT (Binding Artificial perturbations to Incorrect Targets), a novel bi-level optimization formulation. Specifically, the inner level aims at associating the perturbed samples with real labels to simulate standard data-label alignment, while the outer level actively disrupts this alignment by enforcing a mislabel-perturbation binding that maps samples to designated incorrect targets. This mechanism effectively overrides the semantic guidance of priors, forcing the model to rely on the injected perturbations and consequently preventing the acquisition of true semantics. Extensive experiments on standard benchmarks and multiple pretrained backbones demonstrate that BAIT effectively mitigates the influence of pretraining priors and maintains data unlearnability.

Key Contributions

Empirical discovery that pretrained priors allow models to circumvent existing Unlearnable Example protections by recovering genuine semantic representations
BAIT: a bi-level optimization framework that binds perturbations to designated incorrect targets, overriding pretraining priors and restoring data unlearnability
Extensive validation across standard benchmarks and multiple pretrained backbones demonstrating that BAIT maintains data protection where prior UE methods fail

🛡️ Threat Analysis

Data Poisoning Attack

Unlearnable Examples are fundamentally an availability-based data poisoning technique — they corrupt training data so that models learn only spurious shortcuts instead of true semantics. BAIT is a stronger poisoning formulation (mislabel-perturbation binding via bi-level optimization) designed to ensure this corruption persists when an unauthorized party fine-tunes a pretrained backbone on the protected data.

Details

Domains

vision

Model Types

cnntransformer

Threat Tags

training_timewhite_box

Datasets

CIFAR-10ImageNet

Applications

image classificationdata protection against unauthorized training

Read PDF arXiv Code

When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Learning to Forget with Information Divergence Reweighted Objectives for Noisy Labels

Towards Provably Unlearnable Examples via Bayes Error Optimisation

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information

Learning to Generate Cross-Task Unexploitable Examples

How Far Are We from True Unlearnability?

Abstract Gradient Training: A Unified Certification Framework for Data Poisoning, Unlearning, and Differential Privacy

Perturbation-Induced Linearization: Constructing Unlearnable Data with Solely Linear Classifiers

Provable Watermarking for Data Poisoning Attacks