ML Security Papers

Latest papers

5 papers

benchmark arXiv Mar 16, 2026 · 21d ago

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Mateusz Dziemian, Maxwell Lin, Xiaohan Fu et al. · Gray Swan AI · OpenAI +6 more

Large-scale red teaming competition finds all frontier LLM agents vulnerable to concealed indirect prompt injection attacks with 0.5-8.5% success rates

Prompt Injection Excessive Agency nlpmultimodal

PDF

attack arXiv Jan 3, 2026 · Jan 2026

Aggressive Compression Enables LLM Weight Theft

Davis Brown, Juan-Pablo Rivera, Dan Hendrycks et al. · University of Pennsylvania · Georgia Institute of Technology +1 more

Aggressive compression of LLM weights reduces datacenter exfiltration time from months to days, enabling practical weight theft attacks

Model Theft Model Theft nlp

PDF

benchmark arXiv Oct 31, 2025 · Oct 2025

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

Boyi Wei, Zora Che, Nathaniel Li et al. · Scale AI · Princeton University +3 more

Benchmark framework reveals bio-foundation model safety filtering is bypassable via fine-tuning, with dual-use signals persisting in pretrained representations

Transfer Learning Attack generative

PDF

benchmark arXiv Sep 22, 2025 · Sep 2025

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna, Andy Zou, Rahul Gupta et al. · Amazon Nova Responsible AI · Center for AI Safety +2 more

Benchmark dataset for detecting LLMs that hide malicious chain-of-thought behind benign outputs via adversarial system prompt injections

Prompt Injection nlp

2 citations PDF

benchmark arXiv Aug 27, 2025 · Aug 2025

Evaluating Language Model Reasoning about Confidential Information

Dylan Sam, Alexander Robey, Andy Zou et al. · Carnegie Mellon University · Gray Swan AI +1 more

Benchmarks LLM ability to guard confidential info, finding reasoning traces leak secrets and jailbreaks bypass access control

Sensitive Information Disclosure Prompt Injection nlp

PDF Code

Latest papers

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

Aggressive Compression Enables LLM Weight Theft

Best Practices for Biorisk Evaluations on Open-Weight Bio-Foundation Models

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Evaluating Language Model Reasoning about Confidential Information

Filters

Time Period

Paper Type

OWASP ML Top 10

OWASP LLM Top 10

Institution

Venue