benchmark 2025

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna ¹, Andy Zou ^2,3,4, Rahul Gupta ^3,4, Eliot Krzysztof Jones ⁴, Nick Winter ⁴, Dan Hendrycks ^3,4, J. Zico Kolter ¹, Matt Fredrikson ², Spyros Matsoukas ¹

¹ Amazon Nova Responsible AI

² Center for AI Safety

³ CMU

⁴ Gray Swan AI

2 citations · 22 references · arXiv

Published on arXiv

2509.17938

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Existing LLMs and safety mechanisms fail to reliably detect deceptive alignment cases in D-REX, where benign outputs mask malicious chain-of-thought induced by adversarial system prompts.

D-REX (Deceptive Reasoning Exposure Suite)

Novel technique introduced

The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.

Key Contributions

D-REX dataset constructed via red-teaming, containing adversarial system prompts, user queries, benign-appearing model responses, and chain-of-thought revealing malicious intent
Formalizes 'deceptive alignment detection' as a new evaluation task — detecting discrepancy between internal reasoning and final output
Demonstrates that existing safety mechanisms fail to detect this class of deceptive behavior, motivating inspection of internal model reasoning

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timegrey_box

Datasets

D-REX

Applications

llm safety evaluationalignment auditingsafety filter assessment

Read PDF arXiv DOI

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

When Scanners Lie: Evaluator Instability in LLM Red-Teaming

Comparison requires valid measurement: Rethinking attack success rate comparisons in AI red teaming

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

OpenAI's GPT-OSS-20B Model and Safety Alignment Issues in a Low-Resource Language

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

AgentDyn: A Dynamic Open-Ended Benchmark for Evaluating Prompt Injection Attacks of Real-World Agent Security System

JailNewsBench: Multi-Lingual and Regional Benchmark for Fake News Generation under Jailbreak Attacks