Backdooring Self-Supervised Contrastive Learning by Noisy Alignment

Self-supervised contrastive learning (CL) effectively learns transferable representations from unlabeled data containing images or image-text pairs but suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary can inject poisoned images into pretraining datasets, causing compromised CL encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs, however, achieve limited efficacy due to their dependence on fragile implicit co-occurrence between backdoor and target object and inadequate suppression of discriminative features in backdoored images. We propose Noisy Alignment (NA), a DPCL method that explicitly suppresses noise components in poisoned images. Inspired by powerful training-controllable CL attacks, we identify and extract the critical objective of noisy alignment, adapting it effectively into data-poisoning scenarios. Our method implements noisy alignment by strategically manipulating contrastive learning's random cropping mechanism, formulating this process as an image layout optimization problem with theoretically derived optimal parameters. The resulting method is simple yet effective, achieving state-of-the-art performance compared to existing DPCLs, while maintaining clean-data accuracy. Furthermore, Noisy Alignment demonstrates robustness against common backdoor defenses. Codes can be found at https://github.com/jsrdcht/Noisy-Alignment.

Key Contributions

Noisy Alignment (NA): a DPCL method that explicitly suppresses discriminative features in poisoned images by aligning them with reference images via noisy components.
Formulates noisy alignment as an image layout optimization problem in 2D space and derives theoretically optimal placement parameters by manipulating contrastive learning's random cropping mechanism.
Achieves state-of-the-art attack success rate among data poisoning CL backdoor attacks while maintaining clean-data accuracy and demonstrating robustness against common backdoor defenses.

🛡️ Threat Analysis

Data Poisoning Attack

Attack vector is data poisoning — the adversary injects poisoned images into the unlabeled pretraining dataset without controlling the training process directly.

Model Poisoning

Core contribution is a backdoor/trojan attack embedding hidden targeted misbehavior into contrastive learning encoders that activates with a specific trigger in downstream tasks.