Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

Key Contributions

Proposes proof-of-guardrail, a TEE-based system generating cryptographic attestations that a specific open-source guardrail was executed on an agent response
Implements the system for OpenClaw agents and evaluates latency overhead and deployment cost
Identifies residual deception risks even with proof (e.g., malicious developers jailbreaking the guardrail itself)

🛡️ Threat Analysis

Output Integrity Attack

Proposes a verifiable inference scheme using TEE attestation to prove guardrail code was executed on agent outputs — directly addressing output integrity and tamper-evident safety enforcement. A developer could manipulate outputs by bypassing the guardrail without proof; this scheme cryptographically prevents that misrepresentation.

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timewhite_box

Applications

2026 0 cit.

Output Integrity Attack

67%

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Audit the Whisper: Detecting Steganographic Collusion in Multi-Agent LLMs

Password-Activated Shutdown Protocols for Misaligned Frontier Agents

Improving Detection of Watermarked Language Models

Detecting Cognitive Signatures in Typing Behavior for Non-Intrusive Authorship Verification

Optimizing Token Choice for Code Watermarking: An RL Approach

Adaptive Testing for Segmenting Watermarked Texts From Language Models

Online LLM watermark detection via e-processes