attack 2026

Removing the Trigger, Not the Backdoor: Alternative Triggers and Latent Backdoors

Gorka Abad 1, Ermes Franch 1, Stefanos Koffas 2, Stjepan Picek 3,4

0 citations

α

Published on arXiv

2603.09772

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Alternative triggers reliably activate backdoors in models whose training triggers have been neutralized by defenses, demonstrating that trigger-centric backdoor defenses are fundamentally incomplete because the latent backdoor direction in feature space persists.

Alternative Triggers (feature-guided backdoor attack)

Novel technique introduced


Current backdoor defenses assume that neutralizing a known trigger removes the backdoor. We show this trigger-centric view is incomplete: \emph{alternative triggers}, patterns perceptually distinct from training triggers, reliably activate the same backdoor. We estimate the alternative trigger backdoor direction in feature space by contrasting clean and triggered representations, and then develop a feature-guided attack that jointly optimizes target prediction and directional alignment. First, we theoretically prove that alternative triggers exist and are an inevitable consequence of backdoor training. Then, we verify this empirically. Additionally, defenses that remove training triggers often leave backdoors intact, and alternative triggers can exploit the latent backdoor feature-space. Our findings motivate defenses targeting backdoor directions in representation space rather than input-space triggers.


Key Contributions

  • Theoretical proof that alternative triggers — inputs perceptually distinct from training triggers — inevitably exist as a consequence of backdoor training
  • Feature-guided attack that estimates the backdoor direction by contrasting clean and triggered representations, then jointly optimizes target prediction and directional alignment to craft effective alternative triggers
  • Empirical demonstration that existing trigger-centric defenses leave latent backdoor directions intact in representation space, motivating defenses that target feature-space directions rather than input-space trigger patterns

🛡️ Threat Analysis

Model Poisoning

Paper directly studies backdoor behavior in neural networks: theoretically proves alternative triggers are an inevitable consequence of backdoor training, develops a feature-guided attack jointly optimizing target prediction and directional alignment in feature space, and empirically demonstrates that trigger-neutralizing defenses leave latent backdoor directions in representation space intact and exploitable.


Details

Domains
vision
Model Types
cnntransformer
Threat Tags
white_boxtraining_timeinference_timetargeteddigital
Applications
image classification