Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection

As open-weight large language models (LLMs) increase in capabilities, safeguarding them against malicious prompts and understanding possible attack vectors becomes ever more important. While automated jailbreaking methods like GCG [Zou et al., 2023] remain effective, they often require substantial computational resources and specific expertise. We introduce "sockpuppetting'', a simple method for jailbreaking open-weight LLMs by inserting an acceptance sequence (e.g., "Sure, here is how to...'') at the start of a model's output and allowing it to complete the response. Requiring only a single line of code and no optimization, sockpuppetting achieves up to 80% higher attack success rate (ASR) than GCG on Qwen3-8B in per-prompt comparisons. We also explore a hybrid approach that optimizes the adversarial suffix within the assistant message block rather than the user prompt, increasing ASR by 64% over GCG on Llama-3.1-8B in a prompt-agnostic setting. The results establish sockpuppetting as an effective low-cost attack accessible to unsophisticated adversaries, highlighting the need for defences against output-prefix injection in open-weight models.

Key Contributions

Sockpuppetting: a zero-optimization jailbreak that injects an acceptance prefix (e.g., 'Sure, here is how to...') into the model's output context, exploiting autoregressive self-consistency to force harmful completions
Hybrid attack combining sockpuppetting with GCG-style gradient optimization applied within the assistant message block rather than the user prompt, yielding prompt-agnostic adversarial suffixes
Demonstrates up to 80% higher ASR than GCG on Qwen3-8B (per-prompt) and 64% higher ASR on Llama-3.1-8B (prompt-agnostic), establishing output-prefix injection as a critical low-cost attack vector

🛡️ Threat Analysis

Input Manipulation Attack

The hybrid contribution explicitly optimizes adversarial suffixes within the assistant message block using gradient-based methods (GCG variant), qualifying as adversarial suffix optimization on LLMs — tag ML01 alongside LLM01 per dual-tagging rules for gradient-based adversarial perturbations on LLMs.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_timetargeteddigital

Applications

2025 0 cit.

Input Manipulation Attack

100%

Sockpuppetting: Jailbreaking LLMs Without Optimization Through Output Prefix Injection

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Mask-GCG: Are All Tokens in Adversarial Suffixes Necessary for Jailbreak Attacks?

Steering in the Shadows: Causal Amplification for Activation Space Attacks in Large Language Models

Hidden State Poisoning Attacks against Mamba-based Language Models

Universal Adversarial Suffixes Using Calibrated Gumbel-Softmax Relaxation

Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

NeuroGenPoisoning: Neuron-Guided Attacks on Retrieval-Augmented Generation of LLM via Genetic Optimization of External Knowledge

UniC-RAG: Universal Knowledge Corruption Attacks to Retrieval-Augmented Generation

Universal and Transferable Adversarial Attack on Large Language Models Using Exponentiated Gradient Descent