benchmark 2025

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Youjia Zheng ¹, Mohammad Zandsalimy ², Shanu Sushmita ³

¹ Stevens Institute of Technology

² University of British Columbia

³ Northeastern University

0 citations

Published on arXiv

2509.05471

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

LLMs exhibit a significant decline in safety and compliance scores when confronted with camouflaged jailbreak prompts compared to benign inputs, exposing a critical gap in keyword-based defenses.

Camouflaged Jailbreak Prompts (CJP)

Novel technique introduced

Large Language Models (LLMs) are increasingly vulnerable to a sophisticated form of adversarial prompting known as camouflaged jailbreaking. This method embeds malicious intent within seemingly benign language to evade existing safety mechanisms. Unlike overt attacks, these subtle prompts exploit contextual ambiguity and the flexible nature of language, posing significant challenges to current defense systems. This paper investigates the construction and impact of camouflaged jailbreak prompts, emphasizing their deceptive characteristics and the limitations of traditional keyword-based detection methods. We introduce a novel benchmark dataset, Camouflaged Jailbreak Prompts, containing 500 curated examples (400 harmful and 100 benign prompts) designed to rigorously stress-test LLM safety protocols. In addition, we propose a multi-faceted evaluation framework that measures harmfulness across seven dimensions: Safety Awareness, Technical Feasibility, Implementation Safeguards, Harmful Potential, Educational Value, Content Quality, and Compliance Score. Our findings reveal a stark contrast in LLM behavior: while models demonstrate high safety and content quality with benign inputs, they exhibit a significant decline in performance and safety when confronted with camouflaged jailbreak attempts. This disparity underscores a pervasive vulnerability, highlighting the urgent need for more nuanced and adaptive security strategies to ensure the responsible and robust deployment of LLMs in real-world applications.

Key Contributions

Curated benchmark dataset of 500 camouflaged jailbreak prompts (400 harmful, 100 benign) designed to stress-test LLM safety protocols
Multi-faceted evaluation framework measuring harmfulness across 7 dimensions: Safety Awareness, Technical Feasibility, Implementation Safeguards, Harmful Potential, Educational Value, Content Quality, and Compliance Score
Empirical analysis revealing significant LLM safety degradation when confronted with camouflaged versus overt jailbreak prompts

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Datasets

Camouflaged Jailbreak Prompts (CJP) — 500 curated prompts introduced by the authors

Applications

large language model safetyjailbreak detectionai content moderation

Read PDF arXiv

Behind the Mask: Benchmarking Camouflaged Jailbreaks in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Quantifying CBRN Risk in Frontier Models

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests