CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models
Nay Myat Min 1, Long H. Pham 1, Hongyu Zhang 2, Jun Sun 1
Published on arXiv
2601.14310
Output Integrity Attack
OWASP ML Top 10 — ML09
Key Finding
CORVUS degrades all tested single-pass hallucination detectors (LLM-Check, SEP, ICR-probe) across four open-weight LLMs using adapters trained on only 1,000 OOD instructions with fewer than 0.5% trainable parameters
CORVUS
Novel technique introduced
Single-pass hallucination detectors rely on internal telemetry (e.g., uncertainty, hidden-state geometry, and attention) of large language models, implicitly assuming hallucinations leave separable traces in these signals. We study a white-box, model-side adversary that fine-tunes lightweight LoRA adapters on the model while keeping the detector fixed, and introduce CORVUS, an efficient red-teaming procedure that learns to camouflage detector-visible telemetry under teacher forcing, including an embedding-space FGSM attention stress test. Trained on 1,000 out-of-distribution Alpaca instructions (<0.5% trainable parameters), CORVUS transfers to FAVA-Annotation across Llama-2, Vicuna, Llama-3, and Qwen2.5, and degrades both training-free detectors (e.g., LLM-Check) and probe-based detectors (e.g., SEP, ICR-probe), motivating adversary-aware auditing that incorporates external grounding or cross-model evidence.
Key Contributions
- CORVUS: a LoRA adapter fine-tuning procedure under teacher forcing that reshapes token entropy, hidden log-volume, and attention diagonality — including an embedding-space FGSM attention stress term — to evade single-pass hallucination detectors
- OOD transfer: adapters trained on only 1,000 Alpaca instructions (<0.5% trainable parameters) transfer to the FAVA-Annotation benchmark across Llama-2, Vicuna, Llama-3, and Qwen2.5 without retraining
- Mechanistic analysis connecting controlled TE/HV/AD shifts to detector failure, showing how modest adapter updates collapse telemetry separability across both training-free (LLM-Check) and probe-based (SEP, ICR-probe) detectors
🛡️ Threat Analysis
CORVUS attacks single-pass hallucination detectors — output integrity/reliability verification systems that assess whether LLM outputs are grounded — by reshaping internal telemetry (token entropy, hidden log-volume, attention diagonality) to defeat detector-visible signals. This is a direct attack on verifiable inference mechanisms that check LLM output trustworthiness, fitting ML09's coverage of tampering with output integrity and defeating output verification schemes.