defense 2026

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Balachandra Devarangadi Sunil , Isheeta Sinha , Piyush Maheshwari , Shantanu Todmal , Shreyan Mallik , Shuchi Mishra

University of Massachusetts Amherst

1 citations · 28 references · arXiv

Published on arXiv

2601.05504

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Pre-existing legitimate memories dramatically reduce MINJA attack effectiveness in realistic deployments, and effective memory sanitization requires careful trust threshold calibration to avoid both over-rejection and under-filtering.

Memory Sanitization with Trust-Aware Retrieval

Novel technique introduced

Large language model agents equipped with persistent memory are vulnerable to memory poisoning attacks, where adversaries inject malicious instructions through query only interactions that corrupt the agents long term memory and influence future responses. Recent work demonstrated that the MINJA (Memory Injection Attack) achieves over 95 % injection success rate and 70 % attack success rate under idealized conditions. However, the robustness of these attacks in realistic deployments and effective defensive mechanisms remain understudied. This work addresses these gaps through systematic empirical evaluation of memory poisoning attacks and defenses in Electronic Health Record (EHR) agents. We investigate attack robustness by varying three critical dimensions: initial memory state, number of indication prompts, and retrieval parameters. Our experiments on GPT-4o-mini, Gemini-2.0-Flash and Llama-3.1-8B-Instruct models using MIMIC-III clinical data reveal that realistic conditions with pre-existing legitimate memories dramatically reduce attack effectiveness. We then propose and evaluate two novel defense mechanisms: (1) Input/Output Moderation using composite trust scoring across multiple orthogonal signals, and (2) Memory Sanitization with trust-aware retrieval employing temporal decay and pattern-based filtering. Our defense evaluation reveals that effective memory sanitization requires careful trust threshold calibration to prevent both overly conservative rejection (blocking all entries) and insufficient filtering (missing subtle attacks), establishing important baselines for future adaptive defense mechanisms. These findings provide crucial insights for securing memory-augmented LLM agents in production environments.

Key Contributions

Systematic empirical evaluation of MINJA attack robustness under realistic conditions (pre-existing memories, varied retrieval parameters) showing significant reduction in attack effectiveness compared to idealized settings
Input/Output Moderation defense using composite trust scoring across multiple orthogonal signals to detect poisoned memory interactions
Memory Sanitization defense with trust-aware retrieval employing temporal decay and pattern-based filtering, with analysis of trust threshold calibration tradeoffs

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

MIMIC-III

Applications

electronic health record (ehr) agentsmemory-augmented llm agentsclinical decision support

Read PDF arXiv DOI Code

Memory Poisoning Attack and Defense on Memory Based LLM-Agents

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems

RAJ-PGA: Reasoning-Activated Jailbreak and Principle-Guided Alignment Framework for Large Reasoning Models

Bias Injection Attacks on RAG Databases and Sanitization Defenses

A Causal Perspective for Enhancing Jailbreak Attack and Defense

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

Robust Safety Monitoring of Language Models via Activation Watermarking

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning