Published on arXiv
2602.22258
Data Poisoning Attack
OWASP ML Top 10 — ML02
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Achieves 95.7% ASR (95% CI: 88–100%) by corrupting only 48 of ~9,600 training labels (p=0.5%), with zero measurable accuracy degradation; formally proves this stealth is bounded above by the minority class fraction β, rendering aggregate accuracy monitoring provably insufficient.
Trigger-Dominance Collapse
Novel technique introduced
Training-data poisoning attacks can induce targeted, undetectable failure in deep neural networks by corrupting a vanishingly small fraction of training labels. We demonstrate this on acoustic vehicle classification using the MELAUDIS urban intersection dataset (approx. 9,600 audio clips, 6 classes): a compact 2-D convolutional neural network (CNN) trained on log-mel spectrograms achieves 95.7% Attack Success Rate (ASR) -- the fraction of target-class test samples misclassified under the attack -- on a Truck-to-Car label-flipping attack at just p=0.5% corruption (48 records), with zero detectable change in aggregate accuracy (87.6% baseline; 95% CI: 88-100%, n=3 seeds). We prove this stealth is structural: the maximum accuracy drop from a complete targeted attack is bounded above by the minority class fraction (beta). For real-world class imbalances (Truck approx. 3%), this bound falls below training-run noise, making aggregate accuracy monitoring provably insufficient regardless of architecture or attack method. A companion backdoor trigger attack reveals a novel trigger-dominance collapse: when the target class is a dataset minority, the spectrogram patch trigger becomes functionally redundant--clean ASR equals triggered ASR, and the attack degenerates to pure label flipping. We formalize the ML training pipeline as an attack surface and propose a trust-minimized defense combining content-addressed artifact hashing, Merkle-tree dataset commitment, and post-quantum digital signatures (ML-DSA-65/CRYSTALS-Dilithium3, NIST FIPS 204) for cryptographically verifiable data provenance.
Key Contributions
- Demonstrates 95.7% targeted ASR at p=0.5% label corruption on acoustic vehicle CNN with statistically undetectable accuracy change, and proves the maximum detectable accuracy drop is bounded by the minority class fraction β.
- Discovers 'trigger-dominance collapse': for minority-class targets, spectrogram-patch backdoor triggers become functionally redundant and the attack degenerates to pure label-flipping regardless of trigger presence.
- Proposes a trust-minimized pipeline defense combining content-addressed artifact hashing, Merkle-tree dataset commitment, and post-quantum digital signatures (ML-DSA-65/CRYSTALS-Dilithium3) for cryptographically verifiable data provenance.
🛡️ Threat Analysis
Primary contribution is a targeted label-flipping attack corrupting training labels (Truck→Car) at p=0.5%, with a formal proof that accuracy-based detection is provably insufficient when target class fraction β is small.
Secondary contribution demonstrates a spectrogram-patch backdoor trigger attack (BadNets-style) and discovers 'trigger-dominance collapse' — when the target is a minority class, clean ASR equals triggered ASR and the backdoor degenerates to pure label-flipping, a novel interaction between backdoor behavior and dataset statistics.