benchmark 2026

CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries

Deep Mehta

0 citations · 9 references · arXiv

α

Published on arXiv

2601.18834

Sensitive Information Disclosure

OWASP LLM Top 10 — LLM06

Key Finding

Extractive cluster summarization leaks canary strings in 50 of 52 canary-containing clusters (96.2%); a k-min=25 threshold combined with regex redaction reduces leakage to zero while maintaining comparable cluster coherence (0.662 vs 0.653).

CanaryBench

Novel technique introduced


Aggregate analytics over conversational data are increasingly used for safety monitoring, governance, and product analysis in large language model systems. A common practice is to embed conversations, cluster them, and publish short textual summaries describing each cluster. While raw conversations may never be exposed, these derived summaries can still pose privacy risks if they contain personally identifying information (PII) or uniquely traceable strings copied from individual conversations. We introduce CanaryBench, a simple and reproducible stress test for privacy leakage in cluster-level conversation summaries. CanaryBench generates synthetic conversations with planted secret strings ("canaries") that simulate sensitive identifiers. Because canaries are known a priori, any appearance of these strings in published summaries constitutes a measurable leak. Using TF-IDF embeddings and k-means clustering on 3,000 synthetic conversations (24 topics) with a canary injection rate of 0.60, we evaluate an intentionally extractive example snippet summarizer that models quote-like reporting. In this configuration, we observe canary leakage in 50 of 52 canary-containing clusters (cluster-level leakage rate 0.961538), along with nonzero regex-based PII indicator counts. A minimal defense combining a minimum cluster-size publication threshold (k-min = 25) and regex-based redaction eliminates measured canary leakage and PII indicator hits in the reported run while maintaining a similar cluster-coherence proxy. We position this work as a societal impacts contribution centered on privacy risk measurement for published analytics artifacts rather than raw user data.


Key Contributions

  • Introduces CanaryBench, a reproducible stress test that injects known canary strings into synthetic LLM conversations, clusters them via TF-IDF + k-means, and measures verbatim canary appearance in published cluster summaries as a proxy for PII leakage.
  • Defines two leakage metrics — per-canary leak rate and cluster-level leak rate — and demonstrates 96.2% cluster-level leakage with extractive summarization on 3,000 synthetic conversations.
  • Shows that a minimal defense combining a minimum cluster-size publication threshold (k-min=25) and regex-based redaction eliminates all measured canary and PII-indicator leakage while preserving topical coherence.

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timeblack_box
Datasets
synthetic conversations (3,000 across 24 topics)
Applications
llm conversation analyticsconversational ai safety monitoringcluster-level summarization pipelines