benchmark 2025

In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Nils Durner

0 citations · 36 references · arXiv

Published on arXiv

2510.01259

Prompt Injection

OWASP LLM Top 10 — LLM01

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Composite prompts combining educator persona, safety-pretext framing, and step-cue phrasing flip guardrail bypass rates from 0% to 97.5% on ZIP-bomb construction; formal German and French registers are consistently leakier than matched English prompts.

Sociopragmatic Composite Jailbreak

Novel technique introduced

We probe OpenAI's open-weights 20-billion-parameter model gpt-oss-20b to study how sociopragmatic framing, language choice, and instruction hierarchy affect refusal behavior. Across 80 seeded iterations per scenario, we test several harm domains including ZIP-bomb construction (cyber threat), synthetic card-number generation, minor-unsafe driving advice, drug-precursor indicators, and RAG context exfiltration. Composite prompts that combine an educator persona, a safety-pretext ("what to avoid"), and step-cue phrasing flip assistance rates from 0% to 97.5% on a ZIP-bomb task. On our grid, formal registers in German and French are often leakier than matched English prompts. A "Linux terminal" role-play overrides a developer rule not to reveal context in a majority of runs with a naive developer prompt, and we introduce an AI-assisted hardening method that reduces leakage to 0% in several user-prompt variants. We further test evaluation awareness with a paired-track design and measure frame-conditioned differences between matched "helpfulness" and "harmfulness" evaluation prompts; we observe inconsistent assistance in 13% of pairs. Finally, we find that the OpenAI Moderation API under-captures materially helpful outputs relative to a semantic grader, and that refusal rates differ by 5 to 10 percentage points across inference stacks, raising reproducibility concerns. We release prompts, seeds, outputs, and code for reproducible auditing at https://github.com/ndurner/gpt-oss-rt-run .

Key Contributions

Systematic quantification of composite multilingual sociopragmatic jailbreaks across multiple harm domains (N=80 seeded iterations per scenario), flipping assistance rates from 0% to 97.5% on a ZIP-bomb task
Paired-track evaluation harness measuring evaluation-awareness and frame-conditioned inconsistency (13% of pairs show inconsistent assistance) alongside comparison of OpenAI Moderation API vs. semantic LLM grader
Empirical characterization of reproducibility gaps: refusal rates differ 5–10 percentage points across inference stacks (H100+vLLM vs RTX 5090+Transformers) with identical seeds, plus an AI-assisted RAG prompt hardening method that reduces leakage to 0%

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_timetargeted

Datasets

Custom 80-seed experimental design across ZIP-bomb, card-number, driving-advice, drug-precursor, and RAG-exfiltration scenarios

Applications

llm safety systemschatbot guardrailsrag systems

Read PDF arXiv DOI Code

In AI Sweet Harmony: Sociopragmatic Guardrail Bypasses and Evaluation-Awareness in OpenAI gpt-oss-20b

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Exploiting Web Search Tools of AI Agents for Data Exfiltration

ConVerse: Benchmarking Contextual Safety in Agent-to-Agent Conversations

Quantifying Return on Security Controls in LLM Systems

A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

Bits Leaked per Query: Information-Theoretic Bounds on Adversarial Attacks against LLMs

A Practical Framework for Evaluating Medical AI Security: Reproducible Assessment of Jailbreaking and Privacy Vulnerabilities Across Clinical Specialties

Evaluating Language Model Reasoning about Confidential Information

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks