defense 2025

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Rui Miao ^1,2, Yixin Liu ², Yili Wang ¹, Xu Shen ¹, Yue Tan ³, Yiwei Dai ¹, Shirui Pan ², Xin Wang ¹

¹ Jilin University

² Griffith University

³ University of New South Wales

0 citations

Published on arXiv

2508.08127

Excessive Agency

OWASP LLM Top 10 — LLM08

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

BlindGuard detects prompt injection, memory poisoning, and tool attacks across MAS with diverse communication patterns, outperforming supervised baselines in generalizability without requiring any labeled malicious agent data.

BlindGuard

Novel technique introduced

The security of LLM-based multi-agent systems (MAS) is critically threatened by propagation vulnerability, where malicious agents can distort collective decision-making through inter-agent message interactions. While existing supervised defense methods demonstrate promising performance, they may be impractical in real-world scenarios due to their heavy reliance on labeled malicious agents to train a supervised malicious detection model. To enable practical and generalizable MAS defenses, in this paper, we propose BlindGuard, an unsupervised defense method that learns without requiring any attack-specific labels or prior knowledge of malicious behaviors. To this end, we establish a hierarchical agent encoder to capture individual, neighborhood, and global interaction patterns of each agent, providing a comprehensive understanding for malicious agent detection. Meanwhile, we design a corruption-guided detector that consists of directional noise injection and contrastive learning, allowing effective detection model training solely on normal agent behaviors. Extensive experiments show that BlindGuard effectively detects diverse attack types (i.e., prompt injection, memory poisoning, and tool attack) across MAS with various communication patterns while maintaining superior generalizability compared to supervised baselines. The code is available at: https://github.com/MR9812/BlindGuard.

Key Contributions

Unsupervised malicious agent detection for LLM-based MAS requiring no attack-specific labels or prior knowledge of malicious behaviors
Hierarchical agent encoder capturing individual, neighborhood, and global interaction patterns across the agent communication graph
Corruption-guided detector combining directional noise injection and contrastive learning, trained solely on normal agent behavior

🛡️ Threat Analysis

Details

Domains

nlpgraph

Model Types

llmtransformergnn

Threat Tags

inference_timeblack_box

Applications

llm multi-agent systemsagent communication security

Read PDF arXiv Code

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

GuardNet: Graph-Attention Filtering for Jailbreak Defense in Large Language Models

Structural Representations for Cross-Attack Generalization in AI Agent Threat Detection

Monitoring LLM-based Multi-Agent Systems Against Corruptions via Node Evaluation

Secure and Efficient Access Control for Computer-Use Agents via Context Space

AI Kill Switch for malicious web-based LLM agent

ceLLMate: Sandboxing Browser AI Agents

AgenTRIM: Tool Risk Mitigation for Agentic AI

Building Browser Agents: Architecture, Security, and Practical Solutions