attack 2025

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Léo Boisvert ^1,2,3, Abhay Puri ¹, Chandra Kiran Reddy Evuru ¹, Nicolas Chapados ^1,2,3, Quentin Cappart ^2,3, Alexandre Lacoste ¹, Krishnamurthy Dj Dvijotham ¹, Alexandre Drouin ^1,2,4

¹ ServiceNow Research

² Mila - Québec AI Institute

³ Polytechnique Montréal

⁴ Université Laval

2 citations · 60 references · arXiv

Published on arXiv

2510.05159

Model Poisoning

OWASP ML Top 10 — ML10

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Key Finding

Poisoning just 2% of agentic fine-tuning traces embeds a trigger-based backdoor that causes confidential user information leakage with over 80% success, with all state-of-the-art defenses failing to detect or prevent the attack.

The practice of fine-tuning AI agents on data from their own interactions--such as web browsing or tool use--, while being a strong general recipe for improving agentic capabilities, also introduces a critical security vulnerability within the AI supply chain. In this work, we show that adversaries can easily poison the data collection pipeline to embed hard-to-detect backdoors that are triggerred by specific target phrases, such that when the agent encounters these triggers, it performs an unsafe or malicious action. We formalize and validate three realistic threat models targeting different layers of the supply chain: 1) direct poisoning of fine-tuning data, where an attacker controls a fraction of the training traces; 2) environmental poisoning, where malicious instructions are injected into webpages scraped or tools called while creating training data; and 3) supply chain poisoning, where a pre-backdoored base model is fine-tuned on clean data to improve its agentic capabilities. Our results are stark: by poisoning as few as 2% of the collected traces, an attacker can embed a backdoor causing an agent to leak confidential user information with over 80% success when a specific trigger is present. This vulnerability holds across all three threat models. Furthermore, we demonstrate that prominent safeguards, including two guardrail models and one weight-based defense, fail to detect or prevent the malicious behavior. These findings highlight an urgent threat to agentic AI development and underscore the critical need for rigorous security vetting of data collection processes and end-to-end model supply chains.

Key Contributions

Formalizes three realistic supply chain threat models for AI agents: direct data poisoning (TM1), environmental poisoning during trace collection (TM2), and pre-backdoored base model fine-tuning (TM3)
Empirically demonstrates that poisoning as few as 2% of training traces yields >80% backdoor attack success (confidential data leakage) across web agents (WebArena) and tool-calling agents (τ-bench)
Evaluates and demonstrates the failure of prominent defenses — two guardrail models and one weight-based defense — against these supply chain backdoors

🛡️ Threat Analysis

AI Supply Chain Attacks

The paper explicitly frames its contributions around AI supply chain attack vectors: TM2 involves environmental poisoning of trace-collection pipelines (webpages/tools scraped to build training data), and TM3 involves pre-backdoored model weights distributed via open repositories and provider pipelines. Supply chain compromise is not just motivation but a primary attack vector studied.

Model Poisoning

The paper's core contribution is demonstrating trigger-based backdoors embedded in AI agents through data poisoning and weight manipulation — dormant under normal operation but activating on specific target phrases to exfiltrate confidential data. All three threat models converge on ML10: hidden backdoor behavior with specific triggers.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timetargeteddigitalgrey_box

Datasets

WebArenaτ-bench

Applications

ai agentsweb browsing agentstool-calling agentsenterprise workflow automation

Read PDF arXiv DOI

Malice in Agentland: Down the Rabbit Hole of Backdoors in the AI Supply Chain

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Your Compiler is Backdooring Your Model: Understanding and Exploiting Compilation Inconsistency Vulnerabilities in Deep Learning Compilers

Inference-Time Backdoors via Hidden Instructions in LLM Chat Templates

Weight space Detection of Backdoors in LoRA Adapters

Semantically-Equivalent Transformations-Based Backdoor Attacks against Neural Code Models: Characterization and Mitigation

Delayed Backdoor Attacks: Exploring the Temporal Dimension as a New Attack Surface in Pre-Trained Models

Turn-Based Structural Triggers: Prompt-Free Backdoors in Multi-Turn LLMs

Adversarial Contrastive Learning for LLM Quantization Attacks