Exact Certification of Neural Networks and Partition Aggregation Ensembles against Label Poisoning

Label-flipping attacks, which corrupt training labels to induce misclassifications at inference, remain a major threat to supervised learning models. This drives the need for robustness certificates that provide formal guarantees about a model's robustness under adversarially corrupted labels. Existing certification frameworks rely on ensemble techniques such as smoothing or partition-aggregation, but treat the corresponding base classifiers as black boxes, yielding overly conservative guarantees. We introduce EnsembleCert, the first certification framework for partition-aggregation ensembles that utilizes white-box knowledge of the base classifiers. Concretely, EnsembleCert yields tighter guarantees than black-box approaches by aggregating per-partition white-box certificates to compute ensemble-level guarantees in polynomial time. To extract white-box knowledge from the base classifiers efficiently, we develop ScaLabelCert, a method that leverages the equivalence between sufficiently wide neural networks and kernel methods using the neural tangent kernel. ScaLabelCert yields the first exact, polynomial-time calculable certificate for neural networks against label-flipping attacks. EnsembleCert is either on par, or significantly outperforms the existing partition-based black box certificates. Exemplary, on CIFAR-10, our method can certify upto +26.5% more label flips in median over the test set compared to the existing black-box approach while requiring 100 times fewer partitions, thus, challenging the prevailing notion that heavy partitioning is a necessity for strong certified robustness.

Key Contributions

EnsembleCert: first white-box certification framework for partition-aggregation ensembles against label poisoning
ScaLabelCert: first exact, polynomial-time certificate for neural networks against label-flipping using neural tangent kernels
Achieves +26.5% stronger median certification on CIFAR-10 with 100x fewer partitions than black-box baselines

🛡️ Threat Analysis

Data Poisoning Attack

Label-flipping attacks are a classic form of data poisoning where an adversary corrupts training labels to degrade model performance. The paper proposes certification (defense) methods that provide formal robustness guarantees against such poisoning attacks.