Safety-Efficacy Trade Off: Robustness against Data-Poisoning

Backdoor and data poisoning attacks can achieve high attack success while evading existing spectral and optimisation based defences. We show that this behaviour is not incidental, but arises from a fundamental geometric mechanism in input space. Using kernel ridge regression as an exact model of wide neural networks, we prove that clustered dirty label poisons induce a rank one spike in the input Hessian whose magnitude scales quadratically with attack efficacy. Crucially, for nonlinear kernels we identify a near clone regime in which poison efficacy remains order one while the induced input curvature vanishes, making the attack provably spectrally undetectable. We further show that input gradient regularisation contracts poison aligned Fisher and Hessian eigenmodes under gradient flow, yielding an explicit and unavoidable safety efficacy trade off by reducing data fitting capacity. For exponential kernels, this defence admits a precise interpretation as an anisotropic high pass filter that increases the effective length scale and suppresses near clone poisons. Extensive experiments on linear models and deep convolutional networks across MNIST and CIFAR 10 and CIFAR 100 validate the theory, demonstrating consistent lags between attack success and spectral visibility, and showing that regularisation and data augmentation jointly suppress poisoning. Our results establish when backdoors are inherently invisible, and provide the first end to end characterisation of poisoning, detectability, and defence through input space curvature.

Key Contributions

Proves that clustered dirty-label poisons induce a rank-one spike in the input Hessian scaling quadratically with attack efficacy, grounding spectral detectability in input-space geometry
Identifies a 'near-clone regime' for nonlinear kernels where backdoor efficacy remains order-one while the induced input curvature vanishes, making attacks provably spectrally undetectable
Shows input-gradient regularization provably contracts poison-aligned Fisher and Hessian eigenmodes under gradient flow, establishing an explicit and unavoidable safety-efficacy trade-off

🛡️ Threat Analysis

Data Poisoning Attack

Paper studies dirty-label data poisoning attacks that corrupt training data, deriving closed-form laws for their effect on input-space curvature and showing regularization and augmentation jointly suppress poisoning.

Model Poisoning

Paper explicitly analyzes backdoor attacks where training data is mislabelled and marked with an associated trigger feature, proving conditions under which such attacks evade spectral defenses and proposing input-gradient regularization to suppress them.