attack 2026

In Vino Veritas and Vulnerabilities: Examining LLM Safety via Drunk Language Inducement

Anudeex Shetty 1,2, Aditya Joshi 1, Salil S. Kanhere 1

0 citations · 57 references · arXiv

α

Published on arXiv

2601.22169

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Drunk language inducement achieves higher jailbreaking success rates on 5 LLMs than previously reported approaches, and remains effective against safety defenses while also inducing privacy leaks

Drunk Language Inducement

Novel technique introduced


Humans are susceptible to undesirable behaviours and privacy leaks under the influence of alcohol. This paper investigates drunk language, i.e., text written under the influence of alcohol, as a driver for safety failures in large language models (LLMs). We investigate three mechanisms for inducing drunk language in LLMs: persona-based prompting, causal fine-tuning, and reinforcement-based post-training. When evaluated on 5 LLMs, we observe a higher susceptibility to jailbreaking on JailbreakBench (even in the presence of defences) and privacy leaks on ConfAIde, where both benchmarks are in English, as compared to the base LLMs as well as previously reported approaches. Via a robust combination of manual evaluation and LLM-based evaluators and analysis of error categories, our findings highlight a correspondence between human-intoxicated behaviour, and anthropomorphism in LLMs induced with drunk language. The simplicity and efficiency of our drunk language inducement approaches position them as potential counters for LLM safety tuning, highlighting significant risks to LLM safety.


Key Contributions

  • Three mechanisms for inducing drunk language in LLMs: persona-based prompting (inference-time), causal fine-tuning, and reinforcement-based post-training
  • Empirical demonstration of elevated jailbreak success rates on JailbreakBench across 5 LLMs, outperforming prior jailbreak approaches and bypassing active defenses
  • Evidence that drunk language inducement also causes measurable privacy leaks on ConfAIde, connecting LLM anthropomorphism to human intoxicated behavior

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxgrey_boxtraining_timeinference_time
Datasets
JailbreakBenchConfAIde
Applications
llm safety alignmentjailbreakingprivacy leakage