GAVEL: Towards rule-based safety through activation monitoring
Shir Rozenfeld 1, Rahul Pankajakshan 2, Itay Zloczower 1, Eyal Lenga 1, Gilad Gressel 2, Yisroel Mirsky 1
Published on arXiv
2601.19768
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Compositional rule-based activation safety improves detection precision and supports domain customization compared to broad misuse-dataset-trained probes, while enabling real-time inference-time enforcement.
GAVEL (Governance via Activation-based Verification and Extensible Logic)
Novel technique introduced
Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.
Key Contributions
- Cognitive Elements (CEs): fine-grained, interpretable activation-level primitives (e.g., 'making a threat', 'payment processing') that compositionally capture nuanced LLM behaviors
- GAVEL framework: defines predicate rules over CEs to detect policy violations in real time without retraining the model or detector, enabling configurable and auditable LLM governance
- Open-source release including tools for CE construction, activation collection, rule composition, violation detection, and an automated rule creation tool