TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised Learning
Yupei Liu , Yanting Wang , Jinyuan Jia
Published on arXiv
2501.04108
Model Poisoning
OWASP ML Top 10 — ML10
Key Finding
TrojanDec effectively identifies and removes trojan triggers from test inputs, outperforming state-of-the-art defenses against multiple trojan attack methods on SSL encoders.
TrojanDec
Novel technique introduced
An image encoder pre-trained by self-supervised learning can be used as a general-purpose feature extractor to build downstream classifiers for various downstream tasks. However, many studies showed that an attacker can embed a trojan into an encoder such that multiple downstream classifiers built based on the trojaned encoder simultaneously inherit the trojan behavior. In this work, we propose TrojanDec, the first data-free method to identify and recover a test input embedded with a trigger. Given a (trojaned or clean) encoder and a test input, TrojanDec first predicts whether the test input is trojaned. If not, the test input is processed in a normal way to maintain the utility. Otherwise, the test input will be further restored to remove the trigger. Our extensive evaluation shows that TrojanDec can effectively identify the trojan (if any) from a given test input and recover it under state-of-the-art trojan attacks. We further demonstrate by experiments that our TrojanDec outperforms the state-of-the-art defenses.
Key Contributions
- First data-free method to detect trojaned test inputs at inference time for self-supervised learning encoders
- Two-stage pipeline: trigger presence prediction followed by trigger removal and input restoration
- Demonstrated superiority over state-of-the-art backdoor defenses across multiple trojan attack methods
🛡️ Threat Analysis
Directly defends against backdoor/trojan attacks: the paper proposes TrojanDec to detect whether a test input contains an embedded trigger (inherited from a trojaned SSL encoder) and to restore the clean version by removing the trigger. This is a classic ML10 inference-time backdoor defense.