Red-teaming Activation Probes using Prompted LLMs

Activation probes are attractive monitors for AI systems due to low cost and latency, but their real-world robustness remains underexplored. We ask: What failure modes arise under realistic, black-box adversarial pressure, and how can we surface them with minimal effort? We present a lightweight black-box red-teaming procedure that wraps an off-the-shelf LLM with iterative feedback and in-context learning (ICL), and requires no fine-tuning, gradients, or architectural access. Running a case study with probes for high-stakes interactions, we show that our approach can help discover valuable insights about a SOTA probe. Our analysis uncovers interpretable brittleness patterns (e.g., legalese-induced FPs; bland procedural tone FNs) and reduced but persistent vulnerabilities under scenario-constraint attacks. These results suggest that simple prompted red-teaming scaffolding can anticipate failure patterns before deployment and might yield promising, actionable insights to harden future probes.

Key Contributions

A training-free, gradient-free black-box red-teaming scaffold that wraps an LLM with iterative feedback and in-context learning to generate adversarial inputs against activation probe monitors.
Discovery of interpretable failure patterns in a SOTA high-stakes activation probe (e.g., legalese-induced false positives; bland procedural tone causing false negatives), despite the probe achieving 0.91 AUROC on OOD evaluation data.
Demonstration that scenario-constrained adversarial attacks (e.g., inputs must describe a medical chatbot interaction) remain feasible with reduced but persistent probe vulnerabilities.

🛡️ Threat Analysis

Input Manipulation Attack

The paper crafts natural-language inputs at inference time to cause misclassification (false positives and false negatives) in an activation probe classifier — a direct evasion attack against an ML safety monitor.

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

black_boxinference_time

Datasets

high-stakes interaction evaluation datasets (McKenzie et al. 2025)

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Overcoming Black-box Attack Inefficiency with Hybrid and Dynamic Select Algorithms

Overcoming the Retrieval Barrier: Indirect Prompt Injection in the Wild for LLM Systems

ImportSnare: Directed "Code Manual" Hijacking in Retrieval-Augmented Code Generation

Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation

Dynamic Target Attack

CoTDeceptor:Adversarial Code Obfuscation Against CoT-Enhanced LLM Code Agents

"Someone Hid It": Query-Agnostic Black-Box Attacks on LLM-Based Retrieval

EmoRAG: Evaluating RAG Robustness to Symbolic Perturbations