defense 2026

STAR: Detecting Inference-time Backdoors in LLM Reasoning via State-Transition Amplification Ratio

Seong-Gyu Park , Sohee Park , Jisu Lee , Hyunsik Na , Daeseon Choi

0 citations · 21 references · arXiv

α

Published on arXiv

2601.08511

Model Poisoning

OWASP ML Top 10 — ML10

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves near-perfect AUROC ≈ 1.0 against inference-time backdoor attacks across 8B–70B models with approximately 42× greater efficiency than existing detection baselines

STAR (State-Transition Amplification Ratio)

Novel technique introduced


Recent LLMs increasingly integrate reasoning mechanisms like Chain-of-Thought (CoT). However, this explicit reasoning exposes a new attack surface for inference-time backdoors, which inject malicious reasoning paths without altering model parameters. Because these attacks generate linguistically coherent paths, they effectively evade conventional detection. To address this, we propose STAR (State-Transition Amplification Ratio), a framework that detects backdoors by analyzing output probability shifts. STAR exploits the statistical discrepancy where a malicious input-induced path exhibits high posterior probability despite a low prior probability in the model's general knowledge. We quantify this state-transition amplification and employ the CUSUM algorithm to detect persistent anomalies. Experiments across diverse models (8B-70B) and five benchmark datasets demonstrate that STAR exhibits robust generalization capabilities, consistently achieving near-perfect performance (AUROC $\approx$ 1.0) with approximately $42\times$ greater efficiency than existing baselines. Furthermore, the framework proves robust against adaptive attacks attempting to bypass detection.


Key Contributions

  • STAR framework that detects inference-time LLM backdoors by measuring the State-Transition Amplification Ratio — the statistical discrepancy between low prior probability and high posterior probability of malicious reasoning paths
  • Application of the CUSUM algorithm to identify persistent anomalies in output probability shifts, achieving near-perfect AUROC ≈ 1.0 with inference latency under 0.5 seconds
  • Comprehensive evaluation across 8B–70B models and five benchmark datasets, including robustness against adaptive adversarial bypass attempts

🛡️ Threat Analysis

Model Poisoning

Paper directly targets backdoor attacks — triggers injected into inputs that activate hidden malicious behavior (wrong reasoning paths → wrong answers) while the model behaves normally otherwise. Even though the mechanism is inference-time rather than weight-embedded, the trigger-activated targeted malicious behavior is the defining hallmark of ML10, and STAR is explicitly a backdoor detection defense.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timetargetedblack_box
Datasets
five unnamed benchmark datasets (not specified in available text)
Applications
llm reasoningchain-of-thoughtquestion answering