Dan Hendrycks

benchmark arXiv Sep 22, 2025 · Sep 2025

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna, Andy Zou, Rahul Gupta et al. · Amazon Nova Responsible AI · Center for AI Safety +2 more

Benchmark dataset for detecting LLMs that hide malicious chain-of-thought behind benign outputs via adversarial system prompt injections

Prompt Injection nlp

2 citations PDF

attack arXiv Jan 3, 2026 · Jan 2026

Aggressive Compression Enables LLM Weight Theft

Davis Brown, Juan-Pablo Rivera, Dan Hendrycks et al. · University of Pennsylvania · Georgia Institute of Technology +1 more

Aggressive compression of LLM weights reduces datacenter exfiltration time from months to days, enabling practical weight theft attacks

Model Theft Model Theft nlp

PDF

benchmark arXiv Oct 31, 2025 · Oct 2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Boyi Wei, Zora Che, Nathaniel Li et al. · Scale AI · Princeton University +3 more

Benchmark framework reveals bio-foundation model safety filtering is bypassable via fine-tuning, with dual-use signals persisting in pretrained representations

Transfer Learning Attack generative

PDF

Papers in Database (3)

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Aggressive Compression Enables LLM Weight Theft

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models