CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

Single-pass hallucination detectors rely on internal telemetry (e.g., uncertainty, hidden-state geometry, and attention) of large language models, implicitly assuming hallucinations leave separable traces in these signals. We study a white-box, model-side adversary that fine-tunes lightweight LoRA adapters on the model while keeping the detector fixed, and introduce CORVUS, an efficient red-teaming procedure that learns to camouflage detector-visible telemetry under teacher forcing, including an embedding-space FGSM attention stress test. Trained on 1,000 out-of-distribution Alpaca instructions (<0.5% trainable parameters), CORVUS transfers to FAVA-Annotation across Llama-2, Vicuna, Llama-3, and Qwen2.5, and degrades both training-free detectors (e.g., LLM-Check) and probe-based detectors (e.g., SEP, ICR-probe), motivating adversary-aware auditing that incorporates external grounding or cross-model evidence.

Key Contributions

CORVUS: a LoRA adapter fine-tuning procedure under teacher forcing that reshapes token entropy, hidden log-volume, and attention diagonality — including an embedding-space FGSM attention stress term — to evade single-pass hallucination detectors
OOD transfer: adapters trained on only 1,000 Alpaca instructions (<0.5% trainable parameters) transfer to the FAVA-Annotation benchmark across Llama-2, Vicuna, Llama-3, and Qwen2.5 without retraining
Mechanistic analysis connecting controlled TE/HV/AD shifts to detector failure, showing how modest adapter updates collapse telemetry separability across both training-free (LLM-Check) and probe-based (SEP, ICR-probe) detectors

🛡️ Threat Analysis

Output Integrity Attack

CORVUS attacks single-pass hallucination detectors — output integrity/reliability verification systems that assess whether LLM outputs are grounded — by reshaping internal telemetry (token entropy, hidden log-volume, attention diagonality) to defeat detector-visible signals. This is a direct attack on verifiable inference mechanisms that check LLM output trustworthiness, fitting ML09's coverage of tampering with output integrity and defeating output verification schemes.