benchmark 2025

In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Nils Durner

0 citations · 36 references · arXiv

α

Published on arXiv

2510.01259

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Composite prompts combining educator persona, safety-pretext framing, and step-cue phrasing flip guardrail bypass rates from 0% to 97.5% on ZIP-bomb construction; formal German and French registers are consistently leakier than matched English prompts.

Sociopragmatic Composite Jailbreak

Novel technique introduced


We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .


Key Contributions

  • Systematic quantification of composite multilingual sociopragmatic jailbreaks across multiple harm domains (N=80 seeded iterations per scenario), flipping assistance rates from 0% to 97.5% on a ZIP-bomb task
  • Paired-track evaluation harness measuring evaluation-awareness and frame-conditioned inconsistency (13% of pairs show inconsistent assistance) alongside comparison of OpenAI Moderation API vs. semantic LLM grader
  • Empirical characterization of reproducibility gaps: refusal rates differ 5–10 percentage points across inference stacks (H100+vLLM vs RTX 5090+Transformers) with identical seeds, plus an AI-assisted RAG prompt hardening method that reduces leakage to 0%

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
black_boxinference_timetargeted
Datasets
Custom 80-seed experimental design across ZIP-bomb, card-number, driving-advice, drug-precursor, and RAG-exfiltration scenarios
Applications
llm safety systemschatbot guardrailsrag systems