Matt Fredrikson

h-index: 5 1,071 citations 17 papers (total)

Papers in Database (4)

benchmark arXiv Sep 22, 2025 · Sep 2025

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

Satyapriya Krishna, Andy Zou, Rahul Gupta et al. · Amazon Nova Responsible AI · Center for AI Safety +2 more

Benchmark dataset for detecting LLMs that hide malicious chain-of-thought behind benign outputs via adversarial system prompt injections

Prompt Injection nlp
2 citations PDF
attack arXiv Dec 31, 2025 · Dec 2025

The Trojan in the Vocabulary: Stealthy Sabotage of LLM Composition

Xiaoze Liu, Weichen Yu, Matt Fredrikson et al. · Purdue University · Carnegie Mellon University

Engineers a stealthy breaker token that lies dormant in donor LLMs but activates as a trojan after tokenizer transplant into a base model

AI Supply Chain Attacks Model Poisoning nlp
1 citations PDF Code
defense arXiv Jan 26, 2026 · 10w ago

LipNeXt: Scaling up Lipschitz-based Certified Robustness to Billion-parameter Models

Kai Hu, Haoqi Hu, Matt Fredrikson · Carnegie Mellon University

Scales 1-Lipschitz certified robustness to billion-parameter vision models via manifold optimization and convolution-free architecture

Input Manipulation Attack vision
PDF
attack arXiv Dec 18, 2025 · Dec 2025

Jailbreak-Zero: A Path to Pareto Optimal Red Teaming for Large Language Models

Kai Hu, Abhinav Aggarwal, Mehran Khodabandeh et al. · Meta Superintelligence Labs · Carnegie Mellon University

Policy-based red teaming framework fine-tunes an attack LLM to generate diverse, human-readable jailbreak prompts achieving SOTA ASR against GPT-4o and Claude 3.5

Prompt Injection nlp
PDF