defense 2026

Safety-Efficacy Trade Off: Robustness against Data-Poisoning

Diego Granziol

0 citations · 41 references · arXiv (Cornell University)

α

Published on arXiv

2602.00822

Model Poisoning

OWASP ML Top 10 — ML10

Data Poisoning Attack

OWASP ML Top 10 — ML02

Key Finding

Identifies a near-clone regime where backdoor attacks sustain order-one attack success while inducing zero spectral curvature signature, proving such attacks are inherently invisible to spectral defenses while input-gradient regularization creates a fundamental reduction in data-fitting capacity as the price of suppression.

Input-Gradient Regularization

Novel technique introduced


Backdoor and data poisoning attacks can achieve high attack success while evading existing spectral and optimisation based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty label poisons induce a rank one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near clone regime in which poison efficacy remains order one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input gradient regularisation contracts poison aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety efficacy trade off by reducing data fitting capacity. For exponential kernels, this defence admits a precise interpretation as an anisotropic high pass filter that increases the effective length scale and suppresses near clone poisons. Extensive experiments on linear models and deep convolutional networks across MNIST and CIFAR 10 and CIFAR 100 validate the theory, demonstrating consistent lags between attack success and spectral visibility, and showing that regularisation and data augmentation jointly suppress poisoning. Our results establish when backdoors are inherently invisible, and provide the first end to end characterisation of poisoning, detectability, and defence through input space curvature.


Key Contributions

  • Proves that clustered dirty-label poisons induce a rank-one spike in the input Hessian scaling quadratically with attack efficacy, grounding spectral detectability in input-space geometry
  • Identifies a 'near-clone regime' for nonlinear kernels where backdoor efficacy remains order-one while the induced input curvature vanishes, making attacks provably spectrally undetectable
  • Shows input-gradient regularization provably contracts poison-aligned Fisher and Hessian eigenmodes under gradient flow, establishing an explicit and unavoidable safety-efficacy trade-off

🛡️ Threat Analysis

Data Poisoning Attack

Paper studies dirty-label data poisoning attacks that corrupt training data, deriving closed-form laws for their effect on input-space curvature and showing regularization and augmentation jointly suppress poisoning.

Model Poisoning

Paper explicitly analyzes backdoor attacks where training data is mislabelled and marked with an associated trigger feature, proving conditions under which such attacks evade spectral defenses and proposing input-gradient regularization to suppress them.


Details

Domains
vision
Model Types
cnntraditional_ml
Threat Tags
training_timetargetedwhite_box
Datasets
MNISTCIFAR-10CIFAR-100
Applications
image classification