attack 2026

H-Node Attack and Defense in Large Language Models

Eric Yocam 1, Varghese Vaidyan 2, Yong Wang 3

0 citations

α

Published on arXiv

2603.26045

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Attack achieves 3.02× hallucination selectivity; adaptive defense reduces grounded drift by 33-42% with <5% perplexity impact and ≤3% MMLU degradation

H-Node ANC

Novel technique introduced


We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions -- termed Hallucination Nodes (H-Nodes) -- with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks cancellation targets across successive passes recovers up to 0.69 robustness from a single-pass baseline of 8%. All contributions are validated on OPT-125M, Phi-3-mini-4k-instruct, LLaMA-3-8B-Instruct, and Mistral-7B-Instruct-v0.3 (125M-8B parameters). Perplexity impact is surgical (<5%) and MMLU degradation is at most 3%, confirming that the defense does not impair general reasoning capability.


Key Contributions

  • Identifies Hallucination Nodes (H-Nodes) in transformer hidden states that localize hallucination signal with 0.90 AUC via logistic regression probes
  • White-box attack amplifying H-Node activations via real-time forward hooks, achieving 3.02× selectivity with <10% defender visibility
  • Adaptive ANC defense with confidence-weighted cancellation reducing grounded activation drift by 33-42%, and dynamic iterative re-ranking recovering 0.69 robustness

🛡️ Threat Analysis

Input Manipulation Attack

White-box adversarial attack manipulating hidden-state activations at inference time via forward hooks to amplify hallucination behavior. This is input manipulation at the representation level (inference-time perturbation of internal activations), not training-time poisoning.


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
white_boxinference_timetargeted
Datasets
TruthfulQA
Applications
language generationfactual question answering