Exploring Dynamic Properties of Backdoor Training Through Information Bottleneck
Xinyu Liu 1, Xu Zhang 2, Can Chen 3, Ren Wang 2
Published on arXiv
2511.21923
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
Visually conspicuous backdoor attacks like BadNets integrate more seamlessly into model representations than many visually imperceptible attacks, as measured by mutual information dynamics under the Information Bottleneck framework.
Dynamics-based Stealthiness Metric (IB-Backdoor)
Novel technique introduced
Understanding how backdoor data influences neural network training dynamics remains a complex and underexplored challenge. In this paper, we present a rigorous analysis of the impact of backdoor data on the learning process, with a particular focus on the distinct behaviors between the target class and other clean classes. Leveraging the Information Bottleneck (IB) principle connected with clustering of internal representation, We find that backdoor attacks create unique mutual information (MI) signatures, which evolve across training phases and differ based on the attack mechanism. Our analysis uncovers a surprising trade-off: visually conspicuous attacks like BadNets can achieve high stealthiness from an information-theoretic perspective, integrating more seamlessly into the model than many visually imperceptible attacks. Building on these insights, we propose a novel, dynamics-based stealthiness metric that quantifies an attack's integration at the model level. We validate our findings and the proposed metric across multiple datasets and diverse attack types, offering a new dimension for understanding and evaluating backdoor threats. Our code is available in: https://github.com/XinyuLiu71/Information_Bottleneck_Backdoor.git.
Key Contributions
- Information Bottleneck-based analysis revealing distinct mutual information signatures for backdoor vs. clean class data across training phases
- Counter-intuitive finding that visually conspicuous attacks (e.g., BadNets) can be more stealthy than imperceptible attacks from an information-theoretic perspective
- Novel dynamics-based stealthiness metric quantifying a backdoor attack's model-level integration, validated across multiple datasets and attack types
🛡️ Threat Analysis
Paper rigorously analyzes how backdoor/trojan attacks embed hidden targeted behavior into neural networks during training, studying mutual information signatures of backdoor data across training phases and proposing a metric to quantify backdoor integration stealthiness.