Layerwise Convergence Fingerprints for Runtime Misbehavior Detection in Large Language Models
Nay Myat Min , Long H. Pham , Jun Sun
Published on arXiv
2604.24542
Model Poisoning
OWASP ML Top 10 — ML10
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Reduces backdoor ASR below 1% on Qwen2.5-7B and Gemma-2, detects 92-100% of DAN jailbreaks and 100% of prompt injections at 12-16% FPR with <0.1% overhead
Layerwise Convergence Fingerprinting (LCF)
Novel technique introduced
Large language models deployed at runtime can misbehave in ways that clean-data validation cannot anticipate: training-time backdoors lie dormant until triggered, jailbreaks subvert safety alignment, and prompt injections override the deployer's instructions. Existing runtime defenses address these threats one at a time and often assume a clean reference model, trigger knowledge, or editable weights, assumptions that rarely hold for opaque third-party artifacts. We introduce Layerwise Convergence Fingerprinting (LCF), a tuning-free runtime monitor that treats the inter-layer hidden-state trajectory as a health signal: LCF computes a diagonal Mahalanobis distance on every inter-layer difference, aggregates via Ledoit-Wolf shrinkage, and thresholds via leave-one-out calibration on 200 clean examples, with no reference model, trigger knowledge, or retraining. Evaluated on four architectures (Llama-3-8B, Qwen2.5-7B, Gemma-2-9B, Qwen2.5-14B) across backdoors, jailbreaks, and prompt injection (56 backdoor combinations, 3 jailbreak techniques, and BIPIA email + code-QA), LCF reduces mean backdoor attack success rate (ASR) below 1% on Qwen2.5-7B and Gemma-2 and to 1.3% on Qwen2.5-14B, detects 92-100% of DAN jailbreaks (62-100% for GCG and softer role-play), and flags 100% of text-payload injections across all eight (model, domain) cells, at 12-16% backdoor FPR and <0.1% inference overhead. A single aggregation score covers all three threat families without threat-specific tuning, positioning LCF as a general-purpose runtime safety layer for cloud-served and on-device LLMs.
Key Contributions
- Tuning-free runtime monitor using layerwise hidden-state trajectories as a health signal, requiring no reference model, trigger knowledge, or weight editing
- Single unified detection mechanism covering backdoors, jailbreaks, and prompt injection with <0.1% inference overhead
- Evaluation across 4 LLM architectures with 56 backdoor variants, 3 jailbreak techniques, and prompt injection scenarios, achieving sub-1% backdoor ASR and 92-100% jailbreak detection
🛡️ Threat Analysis
Detects backdoor triggers at runtime by identifying anomalous hidden-state trajectories when dormant backdoors activate — evaluated on 56 backdoor combinations, reducing ASR below 1-1.3%.