defense 2025

Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains

Arun Chowdary Sanna

0 citations · 32 references · arXiv

α

Published on arXiv

2511.19874

Model Poisoning

OWASP ML Top 10 — ML10

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Key Finding

Single-model backdoor detectors achieve 92.7% in-distribution accuracy but collapse to 49.2% (near-random) across different LLMs; model-aware detection incorporating model identity as a feature recovers to 90.6% universal accuracy across all evaluated models.

Model-aware behavioral backdoor detection

Novel technique introduced


As AI agents become integral to enterprise workflows, their reliance on shared tool libraries and pre-trained components creates significant supply chain vulnerabilities. While previous work has demonstrated behavioral backdoor detection within individual LLM architectures, the critical question of cross-LLM generalization remains unexplored, a gap with serious implications for organizations deploying multiple AI systems. We present the first systematic study of cross-LLM behavioral backdoor detection, evaluating generalization across six production LLMs (GPT-5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT-OSS 120B, and DeepSeek Chat V3.1). Through 1,198 execution traces and 36 cross-model experiments, we quantify a critical finding: single-model detectors achieve 92.7% accuracy within their training distribution but only 49.2% across different LLMs, a 43.4 percentage point generalization gap equivalent to random guessing. Our analysis reveals that this gap stems from model-specific behavioral signatures, particularly in temporal features (coefficient of variation > 0.8), while structural features remain stable across architectures. We show that model-aware detection incorporating model identity as an additional feature achieves 90.6% accuracy universally across all evaluated models. We release our multi-LLM trace dataset and detection framework to enable reproducible research.


Key Contributions

  • First systematic study of cross-LLM generalization of behavioral backdoor detection, revealing a 43.4 percentage point accuracy collapse (92.7% → 49.2%) when detectors are applied across different LLM architectures
  • Root-cause analysis showing that temporal behavioral features (coefficient of variation > 0.8) are model-specific and drive the generalization gap, while structural features remain stable across architectures
  • Model-aware detection framework incorporating model identity as an additional feature, recovering universal accuracy to 90.6% across all six evaluated production LLMs

🛡️ Threat Analysis

AI Supply Chain Attacks

Explicitly frames the threat as AI agent supply chain vulnerabilities where shared tool libraries and pre-trained LLM components may be backdoored prior to enterprise deployment — supply chain is the core threat model, not incidental framing.

Model Poisoning

Primary contribution is behavioral backdoor detection in LLMs — directly targets model poisoning/trojan threats where backdoored agents exhibit malicious behavior only under specific trigger conditions, studied across six production LLMs.


Details

Domains
nlp
Model Types
llm
Threat Tags
training_timeinference_timeblack_box
Datasets
Multi-LLM trace dataset (1,198 execution traces across GPT-5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT-OSS 120B, DeepSeek Chat V3.1)
Applications
ai agent systemsenterprise llm workflowsshared tool libraries