benchmark 2025

How Catastrophic is Your LLM? Certifying Risk in Conversation

Chengxiao Wang ¹, Isha Chaudhary ¹, Qian Hu ², Weitong Ruan ², Rahul Gupta ², Gagandeep Singh ¹

¹ University of Illinois, Urbana-Champaign

² Amazon

1 citations · 42 references · arXiv (Cornell University)

Published on arXiv

2510.03969

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Certified lower bounds on catastrophic response probability reach as high as 70% for the worst-performing frontier model, exposing critical gaps in current safety training strategies.

C³LLM

Novel technique introduced

Large Language Models (LLMs) can produce catastrophic responses in conversational settings that pose serious risks to public safety and security. Existing evaluations often fail to fully reveal these vulnerabilities because they rely on fixed attack prompt sequences, lack statistical guarantees, and do not scale to the vast space of multi-turn conversations. In this work, we propose C$^3$LLM, a novel, principled statistical Certification framework for Catastrophic risks in multi-turn Conversation for LLMs that bounds the probability of an LLM generating catastrophic responses under multi-turn conversation distributions with statistical guarantees. We model multi-turn conversations as probability distributions over query sequences, represented by a Markov process on a query graph whose edges encode semantic similarity to capture realistic conversational flow, and quantify catastrophic risks using confidence intervals. We define several inexpensive and practical distributions--random node, graph path, and adaptive with rejection. Our results demonstrate that these distributions can reveal substantial catastrophic risks in frontier models, with certified lower bounds as high as 70% for the worst model, highlighting the urgent need for improved safety training strategies in frontier LLMs.

Key Contributions

C³LLM: a principled statistical certification framework that bounds the probability of catastrophic LLM responses under multi-turn conversation distributions with formal statistical guarantees
Modeling multi-turn conversations as Markov processes on query graphs encoding semantic similarity to capture realistic conversational flow
Three practical sampling distributions (random node, graph path, adaptive with rejection) enabling scalable risk quantification via confidence intervals

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

llm safety evaluationmulti-turn conversational aifrontier model red-teaming

Read PDF arXiv DOI

How Catastrophic is Your LLM? Certifying Risk in Conversation

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models

Toxicity Red-Teaming: Benchmarking LLM Safety in Singapore's Low-Resource Languages

Quantifying CBRN Risk in Frontier Models

Read the Scene, Not the Script: Outcome-Aware Safety for LLMs

Understanding LLM Behavior When Encountering User-Supplied Harmful Content in Harmless Tasks

When Your Reviewer is an LLM: Biases, Divergence, and Prompt Injection Risks in Peer Review

MalURLBench: A Benchmark Evaluating Agents' Vulnerabilities When Processing Web URLs

SocialHarmBench: Revealing LLM Vulnerabilities to Socially Harmful Requests