defense 2025

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

Jairo Gudiño-Rosero ^1,2, Clément Contet ^3,4, Umberto Grandi ^3,4, César A. Hidalgo ^2,5,6

¹ Université de Toulouse

² Center for Collective Learning

³ Université Toulouse Capitole

⁴ IRIT

⁵ Corvinus University of Budapest

⁶ University of Manchester

0 citations

Published on arXiv

2508.04281

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Default LLaMA 3.1 8B, GPT-4.1 Nano, and Apertus 8B show high attack success rates under prompt injection; a combined detection, structured representation, and GSPO-RL pipeline reduces ASR to near zero on non-ambiguous consensus outcomes.

GSPO-based robustness pipeline

Novel technique introduced

Large Language Models (LLMs) are gaining traction as a method to generate consensus statements and aggregate preferences in digital democracy experiments. Yet, LLMs could introduce critical vulnerabilities in these systems. Here, we examine the vulnerability and robustness of off-the-shelf consensus-generating LLMs to prompt-injection attacks, in which texts are injected to amplify particular viewpoints, erase certain opinions, or divert consensus toward unrelated or irrelevant topics. We construct attack-free and adversarial variants of prompts containing public policy questions and opinion texts, classify opinion and consensus valences with a fine-tuned BERT model, and estimate Attack Success Rates (ASR) from $3\times3$ confusion matrices conditional on matching human majorities. Across topics, default LLaMA 3.1 8B Instruct, GPT-4.1 Nano, and Apertus 8B exhibit widespread vulnerability, with especially high ASR for economically and socially conservative parties and for rational, instruction-like rhetorical strategies. A robustness pipeline combining GPT-OSS-SafeGuard injection detection, structured opinion representations, and GSPO-based reinforcement learning reduces ASR to near zero across parties and policy clusters when restricting attention to non-ambiguous consensus outcomes. These findings advance our understanding of both the vulnerabilities and the potential defenses of consensus-generating LLMs in digital democracy applications.

Key Contributions

A taxonomy of prompt injection strategies for consensus-generating LLMs, categorized by readability (manual vs. machine-readable), injection type (ignore vs. completion), framing (support vs. criticism), and rhetorical strategy
Empirical evaluation showing widespread vulnerability in LLaMA 3.1 8B Instruct, GPT-4.1 Nano, and Apertus 8B — particularly for instruction-like rhetorical strategies and economically/socially conservative political positions
A robustness pipeline combining GPT-OSS-SafeGuard injection detection, structured opinion representations, and GSPO-based reinforcement learning that reduces ASR to near zero for non-ambiguous consensus outcomes

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_timeblack_boxtargeted

Datasets

Custom public policy opinion datasetDigital democracy experiment prompts

Applications

digital democracyllm-based consensus generationpreference aggregation

Read PDF arXiv

Prompt Injection Vulnerability of Consensus Generating Applications in Digital Democracy

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Robust Safety Monitoring of Language Models via Activation Watermarking

RAPO: Risk-Aware Preference Optimization for Generalizable Safe Reasoning

Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

Bias Injection Attacks on RAG Databases and Sanitization Defenses

Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks

Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs

CodeGuard: Improving LLM Guardrails in CS Education

TRYLOCK: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering