defense 2026

Building Production-Ready Probes For Gemini

János Kramár , Joshua Engels , Zheng Wang , Bilal Chughtai , Rohin Shah , Neel Nanda , Arthur Conmy

3 citations · 60 references · arXiv

α

Published on arXiv

2601.11516

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Novel probe architectures combined with diverse training distributions generalize across production shifts; pairing probes with prompted classifiers achieves optimal accuracy efficiently, enabling live deployment in Gemini.

Production-ready activation probes

Novel technique introduced


Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architectures that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant distribution shifts, including multi-turn conversations, long context prompts, and adaptive red teaming. Our results demonstrate that while our novel architectures address context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.


Key Contributions

  • Novel probe architectures that generalize across the short-to-long-context production distribution shift, a key failure mode of prior activation probes
  • Comprehensive evaluation in the cyber-offensive domain against multi-turn conversations, long-context prompts, and adaptive red-teaming, showing that architecture choice plus diverse training is required for broad generalization
  • Demonstration that pairing probes with prompted classifiers achieves optimal accuracy at low compute cost, with findings informing successful deployment in production Gemini

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llmtransformer
Threat Tags
black_boxgrey_boxinference_time
Applications
llm misuse detectioncyber-offensive content filteringllm safety guardrails