defense 2025

Proactive Disentangled Modeling of Trigger-Object Pairings for Backdoor Defense

Kyle Stein 1, Andrew A. Mahyari 1,2, Guillermo Francia III 1, Eman El-Sheikh 1

0 citations · Computers, Materials & Continu...

α

Published on arXiv

2508.01932

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

DBOM robustly detects poisoned images prior to downstream training on CIFAR-10 and GTSRB, including unseen trigger-object configurations that evade trigger-centric detection pipelines.

DBOM (Disentangled Backdoor-Object Modeling)

Novel technique introduced


Deep neural networks (DNNs) and generative AI (GenAI) are increasingly vulnerable to backdoor attacks, where adversaries embed triggers into inputs to cause models to misclassify or misinterpret target labels. Beyond traditional single-trigger scenarios, attackers may inject multiple triggers across various object classes, forming unseen backdoor-object configurations that evade standard detection pipelines. In this paper, we introduce DBOM (Disentangled Backdoor-Object Modeling), a proactive framework that leverages structured disentanglement to identify and neutralize both seen and unseen backdoor threats at the dataset level. Specifically, DBOM factorizes input image representations by modeling triggers and objects as independent primitives in the embedding space through the use of Vision-Language Models (VLMs). By leveraging the frozen, pre-trained encoders of VLMs, our approach decomposes the latent representations into distinct components through a learnable visual prompt repository and prompt prefix tuning, ensuring that the relationships between triggers and objects are explicitly captured. To separate trigger and object representations in the visual prompt repository, we introduce the trigger-object separation and diversity losses that aids in disentangling trigger and object visual features. Next, by aligning image features with feature decomposition and fusion, as well as learned contextual prompt tokens in a shared multimodal space, DBOM enables zero-shot generalization to novel trigger-object pairings that were unseen during training, thereby offering deeper insights into adversarial attack patterns. Experimental results on CIFAR-10 and GTSRB demonstrate that DBOM robustly detects poisoned images prior to downstream training, significantly enhancing the security of DNN training pipelines.


Key Contributions

  • DBOM: an end-to-end disentangled representation learning framework using frozen VLM encoders that factorizes image embeddings into independent trigger and object latent primitives, enabling zero-shot generalization to unseen trigger-object pairings
  • Dual-branch module with a learnable visual prompt repository and dynamic soft prompt prefix adapter that captures primitive-specific features for both triggers and objects across multiple classes
  • Proactive dataset-level backdoor detection via trigger-object separation and diversity losses, scanning poisoned training data before downstream model training begins

🛡️ Threat Analysis

Model Poisoning

The paper directly defends against backdoor/trojan attacks — adversaries embed trigger patterns into training samples to cause targeted misclassification. DBOM is a backdoor detection/defense framework that neutralizes trigger-based hidden malicious behavior before it can infect a downstream model.


Details

Domains
visionmultimodal
Model Types
vlmcnntransformer
Threat Tags
training_timetargeteddigital
Datasets
CIFAR-10GTSRB
Applications
image classificationtraffic sign recognition