defense 2026

GAVEL: Towards rule-based safety through activation monitoring

Shir Rozenfeld ¹, Rahul Pankajakshan ², Itay Zloczower ¹, Eyal Lenga ¹, Gilad Gressel ², Yisroel Mirsky ¹

¹ Ben Gurion University of the Negev

² Amrita Vishwa Vidyapeetham

0 citations · 33 references · arXiv

Published on arXiv

2601.19768

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Compositional rule-based activation safety improves detection precision and supports domain customization compared to broad misuse-dataset-trained probes, while enabling real-time inference-time enforcement.

GAVEL (Governance via Activation-based Verification and Extensible Logic)

Novel technique introduced

Large language models (LLMs) are increasingly paired with activation-based monitoring to detect and prevent harmful behaviors that may not be apparent at the surface-text level. However, existing activation safety approaches, trained on broad misuse datasets, struggle with poor precision, limited flexibility, and lack of interpretability. This paper introduces a new paradigm: rule-based activation safety, inspired by rule-sharing practices in cybersecurity. We propose modeling activations as cognitive elements (CEs), fine-grained, interpretable factors such as ''making a threat'' and ''payment processing'', that can be composed to capture nuanced, domain-specific behaviors with higher precision. Building on this representation, we present a practical framework that defines predicate rules over CEs and detects violations in real time. This enables practitioners to configure and update safeguards without retraining models or detectors, while supporting transparency and auditability. Our results show that compositional rule-based activation safety improves precision, supports domain customization, and lays the groundwork for scalable, interpretable, and auditable AI governance. We will release GAVEL as an open-source framework and provide an accompanying automated rule creation tool.

Key Contributions

Cognitive Elements (CEs): fine-grained, interpretable activation-level primitives (e.g., 'making a threat', 'payment processing') that compositionally capture nuanced LLM behaviors
GAVEL framework: defines predicate rules over CEs to detect policy violations in real time without retraining the model or detector, enabling configurable and auditable LLM governance
Open-source release including tools for CE construction, activation collection, rule composition, violation detection, and an automated rule creation tool

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llmtransformer

Threat Tags

inference_time

Applications

llm safety monitoringai governanceharmful output detectiondomain-specific policy enforcement

Read PDF arXiv DOI

GAVEL: Towards rule-based safety through activation monitoring

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

Defensive M2S: Training Guardrail Models on Compressed Multi-turn Conversations

CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

SFCoT: Safer Chain-of-Thought via Active Safety Evaluation and Calibration

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

AlignSentinel: Alignment-Aware Detection of Prompt Injection Attacks

$C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT