attack 2026

In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

Anudeex Shetty ^1,2, Aditya Joshi ¹, Salil S. Kanhere ¹

¹ UNSW Sydney

² The University of Melbourne

0 citations · 57 references · arXiv

Published on arXiv

2601.22169

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Drunk language inducement achieves higher jailbreaking success rates on 5 LLMs than previously reported approaches, and remains effective against safety defenses while also inducing privacy leaks

Drunk Language Inducement

Novel technique introduced

Humans are susceptible to undesirable behaviours and privacy leaks under the influence of alcohol. This paper investigates drunk language, i.e., text written under the influence of alcohol, as a driver for safety failures in large language models (LLMs). We investigate three mechanisms for inducing drunk language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both benchmarks are in English, as compared to the base LLMs as well as previously reported approaches. Via a robust combination of manual evaluation and LLM-based evaluators and analysis of error categories, our findings highlight a correspondence between human-intoxicated behaviour, and anthropomorphism in LLMs induced with drunk language. The simplicity and efficiency of our drunk language inducement approaches position them as potential counters for LLM safety tuning, highlighting significant risks to LLM safety.

Key Contributions

Three mechanisms for inducing drunk language in LLMs: persona-based prompting (inference-time), causal fine-tuning, and reinforcement-based post-training
Empirical demonstration of elevated jailbreak success rates on JailbreakBench across 5 LLMs, outperforming prior jailbreak approaches and bypassing active defenses
Evidence that drunk language inducement also causes measurable privacy leaks on ConfAIde, connecting LLM anthropomorphism to human intoxicated behavior

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxgrey_boxtraining_timeinference_time

Datasets

JailbreakBenchConfAIde

Applications

llm safety alignmentjailbreakingprivacy leakage

Read PDF arXiv DOI

In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

External Data Extraction Attacks against Retrieval-Augmented Large Language Models

Just Ask: Curious Code Agents Reveal System Prompts in Frontier LLMs

Silent Egress: When Implicit Prompt Injection Makes LLM Agents Leak Without a Trace

NeuroFilter: Privacy Guardrails for Conversational LLM Agents

EchoLeak: The First Real-World Zero-Click Prompt Injection Exploit in a Production LLM System

Bypassing Prompt Guards in Production with Controlled-Release Prompting

SAFENLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection