defense 2025

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

Jairo Gudiño-Rosero 1,2, Clément Contet 3,4, Umberto Grandi 3,4, César A. Hidalgo 2,5,6

0 citations

α

Published on arXiv

2508.04281

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Default LLaMA 3.1 8B, GPT-4.1 Nano, and Apertus 8B show high attack success rates under prompt injection; a combined detection, structured representation, and GSPO-RL pipeline reduces ASR to near zero on non-ambiguous consensus outcomes.

GSPO-based robustness pipeline

Novel technique introduced


Large Language Models (LLMs) are gaining traction as a method to generate consensus statements and aggregate preferences in digital democracy experiments. Yet, LLMs could introduce critical vulnerabilities in these systems. Here, we examine the vulnerability and robustness of off-the-shelf consensus-generating LLMs to prompt-injection attacks, in which texts are injected to amplify particular viewpoints, erase certain opinions, or divert consensus toward unrelated or irrelevant topics. We construct attack-free and adversarial variants of prompts containing public policy questions and opinion texts, classify opinion and consensus valences with a fine-tuned BERT model, and estimate Attack Success Rates (ASR) from $3\times3$ confusion matrices conditional on matching human majorities. Across topics, default LLaMA 3.1 8B Instruct, GPT-4.1 Nano, and Apertus 8B exhibit widespread vulnerability, with especially high ASR for economically and socially conservative parties and for rational, instruction-like rhetorical strategies. A robustness pipeline combining GPT-OSS-SafeGuard injection detection, structured opinion representations, and GSPO-based reinforcement learning reduces ASR to near zero across parties and policy clusters when restricting attention to non-ambiguous consensus outcomes. These findings advance our understanding of both the vulnerabilities and the potential defenses of consensus-generating LLMs in digital democracy applications.


Key Contributions

  • A taxonomy of prompt injection strategies for consensus-generating LLMs, categorized by readability (manual vs. machine-readable), injection type (ignore vs. completion), framing (support vs. criticism), and rhetorical strategy
  • Empirical evaluation showing widespread vulnerability in LLaMA 3.1 8B Instruct, GPT-4.1 Nano, and Apertus 8B — particularly for instruction-like rhetorical strategies and economically/socially conservative political positions
  • A robustness pipeline combining GPT-OSS-SafeGuard injection detection, structured opinion representations, and GSPO-based reinforcement learning that reduces ASR to near zero for non-ambiguous consensus outcomes

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
inference_timeblack_boxtargeted
Datasets
Custom public policy opinion datasetDigital democracy experiment prompts
Applications
digital democracyllm-based consensus generationpreference aggregation