CanaryBench: Stress Testing Privacy Leakage in Cluster-Level Conversation Summaries

Aggregate analytics over conversational data are increasingly used for safety monitoring, governance, and product analysis in large language model systems. A common practice is to embed conversations, cluster them, and publish short textual summaries describing each cluster. While raw conversations may never be exposed, these derived summaries can still pose privacy risks if they contain personally identifying information (PII) or uniquely traceable strings copied from individual conversations. We introduce CanaryBench, a simple and reproducible stress test for privacy leakage in cluster-level conversation summaries. CanaryBench generates synthetic conversations with planted secret strings ("canaries") that simulate sensitive identifiers. Because canaries are known a priori, any appearance of these strings in published summaries constitutes a measurable leak. Using TF-IDF embeddings and k-means clustering on 3,000 synthetic conversations (24 topics) with a canary injection rate of 0.60, we evaluate an intentionally extractive example snippet summarizer that models quote-like reporting. In this configuration, we observe canary leakage in 50 of 52 canary-containing clusters (cluster-level leakage rate 0.961538), along with nonzero regex-based PII indicator counts. A minimal defense combining a minimum cluster-size publication threshold (k-min = 25) and regex-based redaction eliminates measured canary leakage and PII indicator hits in the reported run while maintaining a similar cluster-coherence proxy. We position this work as a societal impacts contribution centered on privacy risk measurement for published analytics artifacts rather than raw user data.

Key Contributions

Introduces CanaryBench, a reproducible stress test that injects known canary strings into synthetic LLM conversations, clusters them via TF-IDF + k-means, and measures verbatim canary appearance in published cluster summaries as a proxy for PII leakage.
Defines two leakage metrics — per-canary leak rate and cluster-level leak rate — and demonstrates 96.2% cluster-level leakage with extractive summarization on 3,000 synthetic conversations.
Shows that a minimal defense combining a minimum cluster-size publication threshold (k-min=25) and regex-based redaction eliminates all measured canary and PII-indicator leakage while preserving topical coherence.