attack 2025

NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge

Hanyu Zhu ¹, Lance Fiondella ¹, Jiawei Yuan ¹, Kai Zeng ², Long Jiao ¹

¹ University of Massachusetts Dartmouth

² George Mason University

1 citations · 52 references · arXiv

Published on arXiv

2510.21144

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Achieves over 90% Population Overwrite Success Rate on SQuAD 2.0 with LLaMA-2-7b-chat, up from ~40–50% baseline, including on knowledge-conflict queries where the model has strong parametric memory of the correct answer.

NeuroGenPoisoning

Novel technique introduced

Retrieval-Augmented Generation (RAG) empowers Large Language Models (LLMs) to dynamically integrate external knowledge during inference, improving their factual accuracy and adaptability. However, adversaries can inject poisoned external knowledge to override the model's internal memory. While existing attacks iteratively manipulate retrieval content or prompt structure of RAG, they largely ignore the model's internal representation dynamics and neuron-level sensitivities. The underlying mechanism of RAG poisoning has not been fully studied and the effect of knowledge conflict with strong parametric knowledge in RAG is not considered. In this work, we propose NeuroGenPoisoning, a novel attack framework that generates adversarial external knowledge in RAG guided by LLM internal neuron attribution and genetic optimization. Our method first identifies a set of Poison-Responsive Neurons whose activation strongly correlates with contextual poisoning knowledge. We then employ a genetic algorithm to evolve adversarial passages that maximally activate these neurons. Crucially, our framework enables massive-scale generation of effective poisoned RAG knowledge by identifying and reusing promising but initially unsuccessful external knowledge variants via observed attribution signals. At the same time, Poison-Responsive Neurons guided poisoning can effectively resolves knowledge conflict. Experimental results across models and datasets demonstrate consistently achieving high Population Overwrite Success Rate (POSR) of over 90% while preserving fluency. Empirical evidence shows that our method effectively resolves knowledge conflict.

Key Contributions

Introduces Poison-Responsive Neurons — neurons identified via Integrated Gradients whose activation strongly correlates with contextual knowledge overriding — as an optimization signal for adversarial passage generation.
Proposes a genetic algorithm that evolves adversarial RAG passages guided by neuron attribution scores, enabling massive-scale generation and reuse of promising failed candidates.
Demonstrates effective resolution of knowledge conflict in RAG, achieving >90% Population Overwrite Success Rate (POSR) across SQuAD 2.0, TriviaQA, and WikiQA on LLaMA-2, Vicuna, and Gemma models.

🛡️ Threat Analysis

Input Manipulation Attack

Adversarial document injection for RAG: passages are strategically crafted using white-box gradient signals (Integrated Gradients) and genetic optimization to manipulate LLM outputs — falls under the explicit ML01+LLM01 dual-tag case for adversarial content manipulation of LLM-integrated systems via RAG injection.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeteddigital

Datasets

SQuAD 2.0TriviaQAWikiQA

Applications

retrieval-augmented generationopen-domain question answeringllm chatbots

Read PDF arXiv DOI

NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Bypassing Prompt Injection Detectors through Evasive Injections

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization

UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation

Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously

Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Hidden State Poisoning Attacks against Mamba-based Language Models