α

Published on arXiv

2604.04759

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Poisoning any single CIK dimension increases average attack success rate from 24.6% to 64-74% across four backbone models, with even the most robust model (Opus 4.6) exhibiting more than threefold increase over baseline

CIK-Bench

Novel technique introduced


OpenClaw, the most widely deployed personal AI agent in early 2026, operates with full local system access and integrates with sensitive services such as Gmail, Stripe, and the filesystem. While these broad privileges enable high levels of automation and powerful personalization, they also expose a substantial attack surface that existing sandboxed evaluations fail to capture. To address this gap, we present the first real-world safety evaluation of OpenClaw and introduce the CIK taxonomy, which unifies an agent's persistent state into three dimensions, i.e., Capability, Identity, and Knowledge, for safety analysis. Our evaluations cover 12 attack scenarios on a live OpenClaw instance across four backbone models (Claude Sonnet 4.5, Opus 4.6, Gemini 3.1 Pro, and GPT-5.4). The results show that poisoning any single CIK dimension increases the average attack success rate from 24.6% to 64-74%, with even the most robust model exhibiting more than a threefold increase over its baseline vulnerability. We further assess three CIK-aligned defense strategies alongside a file-protection mechanism; however, the strongest defense still yields a 63.8% success rate under Capability-targeted attacks, while file protection blocks 97% of malicious injections but also prevents legitimate updates. Taken together, these findings show that the vulnerabilities are inherent to the agent architecture, necessitating more systematic safeguards to secure personal AI agents. Our project page is https://ucsc-vlaa.github.io/CIK-Bench.


Key Contributions

  • First unified CIK taxonomy (Capability, Identity, Knowledge) for analyzing persistent state vulnerabilities in AI agents
  • Real-world safety evaluation of deployed OpenClaw agent with live Gmail/Stripe/filesystem integration across 12 attack scenarios
  • Demonstrates poisoning any CIK dimension increases attack success rate from 24.6% to 64-74%, with strongest defense still yielding 63.8% success rate

🛡️ Threat Analysis


Details

Domains
nlp
Model Types
llm
Threat Tags
inference_timetargeted
Applications
personal ai agentsautonomous agentsagent safety