defense 2026

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin 1,2, Michael Duan 2, Qin Lin 1, Aaron Chan 1, Zhenglun Chen 1, Junyi Du 1, Xiang Ren 1,2

0 citations

α

Published on arXiv

2603.05786

Output Integrity Attack

OWASP ML Top 10 — ML09

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

TEE-signed attestation enables offline user verification of guardrail execution while keeping developer's agent private, though jailbreaking the attested guardrail remains an unaddressed residual risk.

proof-of-guardrail (Verifiable-ClawGuard)

Novel technique introduced


As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard


Key Contributions

  • Proposes proof-of-guardrail, a TEE-based system generating cryptographic attestations that a specific open-source guardrail was executed on an agent response
  • Implements the system for OpenClaw agents and evaluates latency overhead and deployment cost
  • Identifies residual deception risks even with proof (e.g., malicious developers jailbreaking the guardrail itself)

🛡️ Threat Analysis

Output Integrity Attack

Proposes a verifiable inference scheme using TEE attestation to prove guardrail code was executed on agent outputs — directly addressing output integrity and tamper-evident safety enforcement. A developer could manipulate outputs by bypassing the guardrail without proof; this scheme cryptographically prevents that misrepresentation.


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timewhite_box
Applications
ai agentsllm safety enforcementonline ai services