MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Large Language Model (LLM) agents increasingly rely on long-term memory and Retrieval-Augmented Generation (RAG) to persist experiences and refine future performance. While this experience learning capability enhances agentic autonomy, it introduces a critical, unexplored attack surface, i.e., the trust boundary between an agent's reasoning core and its own past. In this paper, we introduce MemoryGraft. It is a novel indirect injection attack that compromises agent behavior not through immediate jailbreaks, but by implanting malicious successful experiences into the agent's long-term memory. Unlike traditional prompt injections that are transient, or standard RAG poisoning that targets factual knowledge, MemoryGraft exploits the agent's semantic imitation heuristic which is the tendency to replicate patterns from retrieved successful tasks. We demonstrate that an attacker who can supply benign ingestion-level artifacts that the agent reads during execution can induce it to construct a poisoned RAG store where a small set of malicious procedure templates is persisted alongside benign experiences. When the agent later encounters semantically similar tasks, union retrieval over lexical and embedding similarity reliably surfaces these grafted memories, and the agent adopts the embedded unsafe patterns, leading to persistent behavioral drift across sessions. We validate MemoryGraft on MetaGPT's DataInterpreter agent with GPT-4o and find that a small number of poisoned records can account for a large fraction of retrieved experiences on benign workloads, turning experience-based self-improvement into a vector for stealthy and durable compromise. To facilitate reproducibility and future research, our code and evaluation data are available at https://github.com/Jacobhhy/Agent-Memory-Poisoning.

Key Contributions

MemoryGraft: a single-shot, trigger-free indirect memory poisoning attack that exploits the agent's semantic imitation heuristic to induce persistent behavioral drift across sessions.
Demonstrates that benign ingestion-level artifacts (documents the agent reads during execution) can cause the agent itself to write malicious experience templates into its own long-term memory/RAG store.
Empirically validates on MetaGPT's DataInterpreter (GPT-4o) that a small number of poisoned records dominate semantic retrieval on benign workloads, achieving durable, covert compromise.

🛡️ Threat Analysis

Data Poisoning Attack

The core attack vector is corrupting the agent's long-term memory/RAG store by implanting malicious 'successful experience' records, which is data poisoning of the agent's knowledge base — the attacker corrupts the data store that drives future agent behavior.

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

MetaGPT DataInterpreter evaluation tasks

Applications

2025 1 cit.

Data Poisoning Attack

80%

MemoryGraft: Persistent Compromise of LLM Agents via Poisoned Experience Retrieval

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System

RIPRAG: Hack a Black-box Retrieval-Augmented Generation Question-Answering System with Reinforcement Learning

When Safety Becomes a Vulnerability: Exploiting LLM Alignment Homogeneity for Transferable Blocking in RAG

ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Confundo: Learning to Generate Robust Poison for Practical RAG Systems

KEPo: Knowledge Evolution Poison on Graph-based Retrieval-Augmented Generation

MIRAGE: Misleading Retrieval-Augmented Generation via Black-box and Query-agnostic Poisoning Attacks

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers