benchmark 2025

Towards mitigating information leakage when evaluating safety monitors

Gerard Boxo ¹, Aman Neelappa ¹, Shivam Raval ²

¹ Independent

² Harvard University

0 citations · 37 references · arXiv (Cornell University)

Published on arXiv

2509.21344

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Linear probes for detecting harmful LLM behaviors show 10–40% AUROC reduction when textual leakage is removed, with model organism evaluations dropping from 0.94 to 0.57 AUROC for sandbagging, indicating probes detect surface-level text artifacts rather than genuine misaligned internal states.

Leakage Mitigation Framework

Novel technique introduced

White box monitors that analyze model internals offer promising advantages for detecting potentially harmful behaviors in large language models, including lower computational costs and integration into layered defense systems.However, training and evaluating these monitors requires response exemplars that exhibit the target behaviors, typically elicited through prompting or fine-tuning. This presents a challenge when the information used to elicit behaviors inevitably leaks into the data that monitors ingest, inflating their effectiveness. We present a systematic framework for evaluating a monitor's performance in terms of its ability to detect genuine model behavior rather than superficial elicitation artifacts. Furthermore, we propose three novel strategies to evaluate the monitor: content filtering (removing deception-related text from inputs), score filtering (aggregating only over task-relevant tokens), and prompt distilled fine-tuned model organisms (models trained to exhibit deceptive behavior without explicit prompting). Using deception detection as a representative case study, we identify two forms of leakage that inflate monitor performance: elicitation leakage from prompts that explicitly request harmful behavior, and reasoning leakage from models that verbalize their deceptive actions. Through experiments on multiple deception benchmarks, we apply our proposed mitigation strategies and measure performance retention. Our evaluation of the monitors reveal three crucial findings: (1) Content filtering is a good mitigation strategy that allows for a smooth removal of elicitation signal and can decrease probe AUROC by 30\% (2) Score filtering was found to reduce AUROC by 15\% but is not as straightforward to attribute to (3) A finetuned model organism improves monitor evaluations but reduces their performance by upto 40\%, even when re-trained.

Key Contributions

Systematic framework for identifying and quantifying two forms of leakage (elicitation leakage and reasoning leakage) that inflate safety probe performance
Three mitigation strategies — content filtering, score filtering, and prompt-distilled model organisms — to more accurately evaluate probe monitors
Empirical evidence across deception, sandbagging, sycophancy, and bias benchmarks showing probe AUROC drops by 10–40% once textual artifacts are removed, suggesting linear probes rely on surface-level cues rather than genuine internal representations

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

white_boxinference_time

Datasets

Sandbagging benchmarksSycophancy benchmarksBias benchmarks

Applications

llm safety monitoringharmful behavior detectionai alignment evaluation

Read PDF arXiv DOI

Towards mitigating information leakage when evaluating safety monitors

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Unveiling the Latent Directions of Reflection in Large Language Models

RACA: Representation-Aware Coverage Criteria for LLM Safety Testing

Exposing the Systematic Vulnerability of Open-Weight Models to Prefill Attacks

What Matters For Safety Alignment?

Defenses Against Prompt Attacks Learn Surface Heuristics

Analysing the Safety Pitfalls of Steering Vectors

The Struggle Between Continuation and Refusal: A Mechanistic Analysis of the Continuation-Triggered Jailbreak in LLMs