Memory poisoning and secure multi-agent systems

Memory poisoning attacks for Agentic AI and multi-agent systems (MAS) have recently caught attention. It is partially due to the fact that Large Language Models (LLMs) facilitate the construction and deployment of agents. Different memory systems are being used nowadays in this context, including semantic, episodic, and short-term memory. This distinction between the different types of memory systems focuses mostly on their duration but also on their origin and their localization. It ranges from the short-term memory originated at the user's end localized in the different agents to the long-term consolidated memory localized in well established knowledge databases. In this paper, we first present the main types of memory systems, we then discuss the feasibility of memory poisoning attacks in these different types of memory systems, and we propose mitigation strategies. We review the already existing security solutions to mitigate some of the alleged attacks, and we discuss adapted solutions based on cryptography. We propose to implement local inference based on private knowledge retrieval as an example of mitigation strategy for memory poisoning for semantic memory. We also emphasize actual risks in relation to interactions between agents, which can cause memory poisoning. These latter risks are not so much studied in the literature and are difficult to formalize and solve. Thus, we contribute to the construction of agents that are secure by design.

Key Contributions

Taxonomy of memory poisoning attacks across semantic, episodic, and short-term memory systems in LLM-based agents
Cryptographic mitigation strategies including private knowledge retrieval for semantic memory protection
Analysis of inter-agent interaction risks that cause memory poisoning in multi-agent systems

🛡️ Threat Analysis

Data Poisoning Attack

Memory poisoning is a form of data poisoning where malicious agents corrupt the memory/knowledge bases that LLM-based agents use for decision-making. The paper discusses attacks where adversaries inject malicious information into agent memory systems (semantic, episodic, short-term) to corrupt behavior.