defense 2026

RPP: A Certified Poisoned-Sample Detection Framework for Backdoor Attacks under Dataset Imbalance

0 citations · 68 references · arXiv

Published on arXiv

2602.00183

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

RPP achieves significantly higher backdoor detection accuracy than 12 state-of-the-art defenses across 10 attack types, with certified guarantees on false positive rate under class-imbalanced conditions.

RPP (Randomized Probability Perturbation)

Novel technique introduced

Deep neural networks are highly susceptible to backdoor attacks, yet most defense methods to date rely on balanced data, overlooking the pervasive class imbalance in real-world scenarios that can amplify backdoor threats. This paper presents the first in-depth investigation of how the dataset imbalance amplifies backdoor vulnerability, showing that (i) the imbalance induces a majority-class bias that increases susceptibility and (ii) conventional defenses degrade significantly as the imbalance grows. To address this, we propose Randomized Probability Perturbation (RPP), a certified poisoned-sample detection framework that operates in a black-box setting using only model output probabilities. For any inspected sample, RPP determines whether the input has been backdoor-manipulated, while offering provable within-domain detectability guarantees and a probabilistic upper bound on the false positive rate. Extensive experiments on five benchmarks (MNIST, SVHN, CIFAR-10, TinyImageNet and ImageNet10) covering 10 backdoor attacks and 12 baseline defenses show that RPP achieves significantly higher detection accuracy than state-of-the-art defenses, particularly under dataset imbalance. RPP establishes a theoretical and practical foundation for defending against backdoor attacks in real-world environments with imbalanced data.

Key Contributions

First in-depth analysis showing dataset imbalance amplifies backdoor vulnerability via majority-class bias and degrades conventional defenses
RPP: a certified poisoned-sample detection framework using only black-box model output probabilities, with provable within-domain detectability guarantees and a probabilistic false positive rate upper bound
Extensive evaluation across 5 benchmarks, 10 backdoor attacks, and 12 baseline defenses demonstrating superior detection accuracy especially under imbalanced settings

🛡️ Threat Analysis

Model Poisoning

RPP is a backdoor/trojan defense — it detects whether inference-time inputs have been backdoor-manipulated using trigger patterns, evaluated against 10 distinct backdoor attacks with certified detectability guarantees.

Details

Domains

vision

Model Types

cnn

Threat Tags

black_boxtraining_timeinference_timetargeted

Datasets

MNISTSVHNCIFAR-10TinyImageNetImageNet10

Applications

image classification

Read PDF arXiv DOI

RPP: A Certified Poisoned-Sample Detection Framework for Backdoor Attacks under Dataset Imbalance

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Lite-BD: A Lightweight Black-box Backdoor Defense via Reviving Multi-Stage Image Transformations

Prototype-Guided Robust Learning against Backdoor Attacks

Unsupervised Backdoor Detection and Mitigation for Spiking Neural Networks

Kill it with FIRE: On Leveraging Latent Space Directions for Runtime Backdoor Mitigation in Deep Neural Networks

From Internal Diagnosis to External Auditing: A VLM-Driven Paradigm for Online Test-Time Backdoor Defense

TED++: Submanifold-Aware Backdoor Detection via Layerwise Tubular-Neighbourhood Screening

TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised Learning

ArcGen: Generalizing Neural Backdoor Detection Across Diverse Architectures