α

Published on arXiv

2602.07152

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

The TrojAI program mapped the backdoor threat landscape, pioneered foundational detection methods (weight analysis, trigger inversion), and identified persistent unsolved challenges in AI Trojan defense for deployed models.


The Intelligence Advanced Research Projects Activity (IARPA) launched the TrojAI program to confront an emerging vulnerability in modern artificial intelligence: the threat of AI Trojans. These AI trojans are malicious, hidden backdoors intentionally embedded within an AI model that can cause a system to fail in unexpected ways, or allow a malicious actor to hijack the AI model at will. This multi-year initiative helped to map out the complex nature of the threat, pioneered foundational detection methods, and identified unsolved challenges that require ongoing attention by the burgeoning AI security field. This report synthesizes the program's key findings, including methodologies for detection through weight analysis and trigger inversion, as well as approaches for mitigating Trojan risks in deployed models. Comprehensive test and evaluation results highlight detector performance, sensitivity, and the prevalence of "natural" Trojans. The report concludes with lessons learned and recommendations for advancing AI security research.


Key Contributions

  • Synthesizes multi-year IARPA TrojAI program findings on backdoor detection methodologies including weight analysis and trigger inversion
  • Presents comprehensive test and evaluation results for backdoor detector performance, sensitivity, and prevalence of 'natural' Trojans
  • Identifies unsolved challenges and provides recommendations for advancing AI security research against trojan threats

🛡️ Threat Analysis

Model Poisoning

The entire TrojAI program is centered on AI Trojans — malicious hidden backdoors embedded in models that activate on triggers. The report synthesizes detection methods (weight analysis, trigger inversion) and mitigation approaches specifically targeting backdoor/trojan threats.


Details

Domains
visionnlp
Model Types
cnntransformer
Threat Tags
training_timetargeteddigital
Datasets
TrojAI evaluation datasets
Applications
image classificationnatural language processing