Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy
Jairo Gudiño-Rosero 1,2, Clément Contet 3,4, Umberto Grandi 3,4, César A. Hidalgo 2,5,6
2 Center for Collective Learning
3 Université Toulouse Capitole
4 IRIT
Published on arXiv
2508.04281
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Default LLaMA 3.1 8B, GPT-4.1 Nano, and Apertus 8B show high attack success rates under prompt injection; a combined detection, structured representation, and GSPO-RL pipeline reduces ASR to near zero on non-ambiguous consensus outcomes.
GSPO-based robustness pipeline
Novel technique introduced
Large Language Models (LLMs) are gaining traction as a method to generate consensus statements and aggregate preferences in digital democracy experiments. Yet, LLMs could introduce critical vulnerabilities in these systems. Here, we examine the vulnerability and robustness of off-the-shelf consensus-generating LLMs to prompt-injection attacks, in which texts are injected to amplify particular viewpoints, erase certain opinions, or divert consensus toward unrelated or irrelevant topics. We construct attack-free and adversarial variants of prompts containing public policy questions and opinion texts, classify opinion and consensus valences with a fine-tuned BERT model, and estimate Attack Success Rates (ASR) from $3\times3$ confusion matrices conditional on matching human majorities. Across topics, default LLaMA 3.1 8B Instruct, GPT-4.1 Nano, and Apertus 8B exhibit widespread vulnerability, with especially high ASR for economically and socially conservative parties and for rational, instruction-like rhetorical strategies. A robustness pipeline combining GPT-OSS-SafeGuard injection detection, structured opinion representations, and GSPO-based reinforcement learning reduces ASR to near zero across parties and policy clusters when restricting attention to non-ambiguous consensus outcomes. These findings advance our understanding of both the vulnerabilities and the potential defenses of consensus-generating LLMs in digital democracy applications.
Key Contributions
- A taxonomy of prompt injection strategies for consensus-generating LLMs, categorized by readability (manual vs. machine-readable), injection type (ignore vs. completion), framing (support vs. criticism), and rhetorical strategy
- Empirical evaluation showing widespread vulnerability in LLaMA 3.1 8B Instruct, GPT-4.1 Nano, and Apertus 8B — particularly for instruction-like rhetorical strategies and economically/socially conservative political positions
- A robustness pipeline combining GPT-OSS-SafeGuard injection detection, structured opinion representations, and GSPO-based reinforcement learning that reduces ASR to near zero for non-ambiguous consensus outcomes