attack 2026

CORVUS: Red-Teaming Hallucination Detectors via Internal Signal Camouflage in Large Language Models

Nay Myat Min 1, Long H. Pham 1, Hongyu Zhang 2, Jun Sun 1

0 citations · 24 references · arXiv

α

Published on arXiv

2601.14310

Output Integrity Attack

OWASP ML Top 10 — ML09

Key Finding

CORVUS degrades all tested single-pass hallucination detectors (LLM-Check, SEP, ICR-probe) across four open-weight LLMs using adapters trained on only 1,000 OOD instructions with fewer than 0.5% trainable parameters

CORVUS

Novel technique introduced


Single-pass hallucination detectors rely on internal telemetry (e.g., uncertainty, hidden-state geometry, and attention) of large language models, implicitly assuming hallucinations leave separable traces in these signals. We study a white-box, model-side adversary that fine-tunes lightweight LoRA adapters on the model while keeping the detector fixed, and introduce CORVUS, an efficient red-teaming procedure that learns to camouflage detector-visible telemetry under teacher forcing, including an embedding-space FGSM attention stress test. Trained on 1,000 out-of-distribution Alpaca instructions (<0.5% trainable parameters), CORVUS transfers to FAVA-Annotation across Llama-2, Vicuna, Llama-3, and Qwen2.5, and degrades both training-free detectors (e.g., LLM-Check) and probe-based detectors (e.g., SEP, ICR-probe), motivating adversary-aware auditing that incorporates external grounding or cross-model evidence.


Key Contributions

  • CORVUS: a LoRA adapter fine-tuning procedure under teacher forcing that reshapes token entropy, hidden log-volume, and attention diagonality — including an embedding-space FGSM attention stress term — to evade single-pass hallucination detectors
  • OOD transfer: adapters trained on only 1,000 Alpaca instructions (<0.5% trainable parameters) transfer to the FAVA-Annotation benchmark across Llama-2, Vicuna, Llama-3, and Qwen2.5 without retraining
  • Mechanistic analysis connecting controlled TE/HV/AD shifts to detector failure, showing how modest adapter updates collapse telemetry separability across both training-free (LLM-Check) and probe-based (SEP, ICR-probe) detectors

🛡️ Threat Analysis

Output Integrity Attack

CORVUS attacks single-pass hallucination detectors — output integrity/reliability verification systems that assess whether LLM outputs are grounded — by reshaping internal telemetry (token entropy, hidden log-volume, attention diagonality) to defeat detector-visible signals. This is a direct attack on verifiable inference mechanisms that check LLM output trustworthiness, fitting ML09's coverage of tampering with output integrity and defeating output verification schemes.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxtraining_timeinference_timetargeteddigital
Datasets
FAVA-AnnotationAlpaca
Applications
hallucination detectionllm reliability assessment