defense 2025

TrojanDec: Data-free Detection of Trojan Inputs in Self-supervised Learning

Yupei Liu , Yanting Wang , Jinyuan Jia

0 citations

α

Published on arXiv

2501.04108

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

TrojanDec effectively identifies and removes trojan triggers from test inputs, outperforming state-of-the-art defenses against multiple trojan attack methods on SSL encoders.

TrojanDec

Novel technique introduced


An image encoder pre-trained by self-supervised learning can be used as a general-purpose feature extractor to build downstream classifiers for various downstream tasks. However, many studies showed that an attacker can embed a trojan into an encoder such that multiple downstream classifiers built based on the trojaned encoder simultaneously inherit the trojan behavior. In this work, we propose TrojanDec, the first data-free method to identify and recover a test input embedded with a trigger. Given a (trojaned or clean) encoder and a test input, TrojanDec first predicts whether the test input is trojaned. If not, the test input is processed in a normal way to maintain the utility. Otherwise, the test input will be further restored to remove the trigger. Our extensive evaluation shows that TrojanDec can effectively identify the trojan (if any) from a given test input and recover it under state-of-the-art trojan attacks. We further demonstrate by experiments that our TrojanDec outperforms the state-of-the-art defenses.


Key Contributions

  • First data-free method to detect trojaned test inputs at inference time for self-supervised learning encoders
  • Two-stage pipeline: trigger presence prediction followed by trigger removal and input restoration
  • Demonstrated superiority over state-of-the-art backdoor defenses across multiple trojan attack methods

🛡️ Threat Analysis

Model Poisoning

Directly defends against backdoor/trojan attacks: the paper proposes TrojanDec to detect whether a test input contains an embedded trigger (inherited from a trojaned SSL encoder) and to restore the clean version by removing the trigger. This is a classic ML10 inference-time backdoor defense.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
training_timeinference_timetargeteddigital
Applications
image classificationself-supervised learningdownstream classifiers