Matt Fredrikson

benchmark arXiv Sep 22, 2025 · Sep 2025

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna, Andy Zou, Rahul Gupta et al. · Amazon Nova Responsible AI · Center for AI Safety +2 more

Benchmark dataset for detecting LLMs that hide malicious chain-of-thought behind benign outputs via adversarial system prompt injections

Prompt Injection nlp

2 citations PDF

attack arXiv Dec 31, 2025 · Dec 2025

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu, Weichen Yu, Matt Fredrikson et al. · Purdue University · Carnegie Mellon University

Engineers a stealthy breaker token that lies dormant in donor LLMs but activates as a trojan after tokenizer transplant into a base model

AI Supply Chain Attacks Model Poisoning nlp

1 citations PDF Code

defense arXiv Jan 26, 2026 · 10w ago

LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models

Kai Hu, Haoqi Hu, Matt Fredrikson · Carnegie Mellon University

Scales 1-Lipschitz certified robustness to billion-parameter vision models via manifold optimization and convolution-free architecture

Input Manipulation Attack vision

PDF

attack arXiv Dec 18, 2025 · Dec 2025

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh et al. · Meta Superintelligence Labs · Carnegie Mellon University

Policy-based red teaming framework fine-tunes an attack LLM to generate diverse, human-readable jailbreak prompts achieving SOTA ASR against GPT-4o and Claude 3.5

Prompt Injection nlp

PDF

Papers in Database (4)

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models