defense 2025

Inverting Trojans in LLMs

Zhengxing Li 1,2, Guangmingmei Yang 1,2, Jayaram Raghuram 2, David J. Miller 1,2, George Kesidis 1,2

0 citations

α

Published on arXiv

2509.16203

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

BABI reliably detects and successfully recovers ground-truth backdoor trigger phrases in LLMs without requiring access to poisoned training data or known-clean reference models.

BABI

Novel technique introduced


While effective backdoor detection and inversion schemes have been developed for AIs used e.g. for images, there are challenges in "porting" these methods to LLMs. First, the LLM input space is discrete, which precludes gradient-based search over this space, central to many backdoor inversion methods. Second, there are ~30,000^k k-tuples to consider, k the token-length of a putative trigger. Third, for LLMs there is the need to blacklist tokens that have strong marginal associations with the putative target response (class) of an attack, as such tokens give false detection signals. However, good blacklists may not exist for some domains. We propose a LLM trigger inversion approach with three key components: i) discrete search, with putative triggers greedily accreted, starting from a select list of singletons; ii) implicit blacklisting, achieved by evaluating the average cosine similarity, in activation space, between a candidate trigger and a small clean set of samples from the putative target class; iii) detection when a candidate trigger elicits high misclassifications, and with unusually high decision confidence. Unlike many recent works, we demonstrate that our approach reliably detects and successfully inverts ground-truth backdoor trigger phrases.


Key Contributions

  • Discrete greedy search over token space for trigger inversion, bypassing gradient-based methods incompatible with LLMs' discrete input space
  • Implicit blacklisting via cosine similarity in activation space, preventing false positives from tokens naturally correlated with the target class
  • Detection criterion combining high misclassification rate with unusually high decision confidence, enabling unsupervised backdoor detection without known clean/poisoned model pairs

🛡️ Threat Analysis

Model Poisoning

Proposes a post-training backdoor detection and trigger inversion defense (BABI) for LLMs, specifically targeting trigger-based hidden behaviors introduced via data poisoning of training or instruction fine-tuning sets.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
training_timewhite_boxtargeted
Applications
llm text classificationsentiment analysis