defense 2025

Improving the Sensitivity of Backdoor Detectors via Class Subspace Orthogonalization

Guangmingmei Yang 1,2, David J. Miller 1, George Kesidis 1

0 citations · 35 references · arXiv

α

Published on arXiv

2512.08129

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

CSO consistently improves the sensitivity of multiple well-known backdoor detectors against subtle and mixed-label attacks, and retains detection capability even against adaptive attacks designed to defeat it.

CSO (Class Subspace Orthogonalization)

Novel technique introduced


Most post-training backdoor detection methods rely on attacked models exhibiting extreme outlier detection statistics for the target class of an attack, compared to non-target classes. However, these approaches may fail: (1) when some (non-target) classes are easily discriminable from all others, in which case they may naturally achieve extreme detection statistics (e.g., decision confidence); and (2) when the backdoor is subtle, i.e., with its features weak relative to intrinsic class-discriminative features. A key observation is that the backdoor target class has contributions to its detection statistic from both the backdoor trigger and from its intrinsic features, whereas non-target classes only have contributions from their intrinsic features. To achieve more sensitive detectors, we thus propose to suppress intrinsic features while optimizing the detection statistic for a given class. For non-target classes, such suppression will drastically reduce the achievable statistic, whereas for the target class the (significant) contribution from the backdoor trigger remains. In practice, we formulate a constrained optimization problem, leveraging a small set of clean examples from a given class, and optimizing the detection statistic while orthogonalizing with respect to the class's intrinsic features. We dub this plug-and-play approach Class Subspace Orthogonalization (CSO) and assess it against challenging mixed-label and adaptive attacks.


Key Contributions

  • Class Subspace Orthogonalization (CSO): a constrained optimization framework that suppresses intrinsic class features when computing backdoor detection statistics, isolating the backdoor trigger's contribution and improving detector sensitivity
  • A novel mixed clean/dirty-label poisoning attack that is more surgical and harder to detect than traditional dirty-label backdoor attacks, used as a challenging evaluation benchmark
  • Plug-and-play integration with multiple existing post-training backdoor detectors (e.g., NC, MMBD, UNICORN), with evaluation on CIFAR-10, GTSRB, and TinyImageNet including adaptive attack scenarios

🛡️ Threat Analysis

Model Poisoning

Paper's primary contribution is a defense against backdoor/trojan attacks — specifically a post-training detection method (CSO) that improves sensitivity of existing backdoor detectors by orthogonalizing out intrinsic class features so that only the backdoor trigger signal remains for the target class.


Details

Domains
vision
Model Types
cnn
Threat Tags
training_timetargeteddigitalgrey_box
Datasets
CIFAR-10GTSRBTinyImageNet
Applications
image classification