defense 2026

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Ziyuan Yang ¹, Wenxuan Ding ², Shangbin Feng ¹, Yulia Tsvetkov ¹

¹ University of Washington

² New York University

0 citations · 63 references · arXiv (Cornell University)

Published on arXiv

2602.05176

AI Supply Chain Attacks

OWASP ML Top 10 — ML06

Model Poisoning

OWASP ML Top 10 — ML10

Key Finding

Supervisor-based external oversight recovers 96.8% of initial performance in multi-LLM systems compromised by malicious models, though full resistance to malicious contributors remains unsolved.

AmongUs

Novel technique introduced

Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.

Key Contributions

Engineers four categories of malicious LMs (prompting, SFT, inverse-reward RL, activation steering) and quantifies their impact across four multi-LLM collaboration paradigms and 10 datasets
Demonstrates severe worst-case performance drops of 34.9% from malicious models, with routing systems and safety/reasoning domains most affected
Proposes supervisor-free and supervisor-based mitigation strategies that recover 95.31% of baseline collaboration performance on average

🛡️ Threat Analysis

AI Supply Chain Attacks

The core threat model is malicious actors contributing compromised models to decentralized collaborative AI systems (routing, multi-agent debate, model merging) — a supply chain attack where untrusted third-party model contributors can degrade or subvert system behavior.

Model Poisoning

The parametric maliciousness methods (SFT on wrong/malicious outputs, RL with inverse reward functions) embed persistent malicious behavior directly into model weights — constituting model poisoning. The supply chain distribution of these poisoned models warrants co-tagging ML06+ML10 per the taxonomy.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

training_timeinference_timewhite_boxtargeted

Datasets

reasoning benchmarkssafety benchmarkscoding benchmarks

Applications

multi-llm routing systemsmulti-agent debatecollaborative decodingmodel merging

Read PDF arXiv DOI Code

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Weight space Detection of Backdoors in LoRA Adapters

Cross-LLM Generalization of Behavioral Backdoor Detection in AI Agent Supply Chains

Inverting Trojans in LLMs

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models

Lethe: Purifying Backdoored Large Language Models with Knowledge Dilution

Localizing Malicious Outputs from CodeLLM

Backdoor Attribution: Elucidating and Controlling Backdoor in Language Models

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition