defense 2026

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

0 citations · 39 references · arXiv (Cornell University)

Published on arXiv

2602.07918

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Achieves near-zero Attack Success Rate comparable to always-on defenses while preserving benign utility and latency close to the no-defense baseline on AgentDojo and DoomArena

CausalArmor

Novel technique introduced

AI agents equipped with tool-calling capabilities are susceptible to Indirect Prompt Injection (IPI) attacks. In this attack scenario, malicious commands hidden within untrusted content trick the agent into performing unauthorized actions. Existing defenses can reduce attack success but often suffer from the over-defense dilemma: they deploy expensive, always-on sanitization regardless of actual threat, thereby degrading utility and latency even in benign scenarios. We revisit IPI through a causal ablation perspective: a successful injection manifests as a dominance shift where the user request no longer provides decisive support for the agent's privileged action, while a particular untrusted segment, such as a retrieved document or tool output, provides disproportionate attributable influence. Based on this signature, we propose CausalArmor, a selective defense framework that (i) computes lightweight, leave-one-out ablation-based attributions at privileged decision points, and (ii) triggers targeted sanitization only when an untrusted segment dominates the user intent. Additionally, CausalArmor employs retroactive Chain-of-Thought masking to prevent the agent from acting on ``poisoned'' reasoning traces. We present a theoretical analysis showing that sanitization based on attribution margins conditionally yields an exponentially small upper bound on the probability of selecting malicious actions. Experiments on AgentDojo and DoomArena demonstrate that CausalArmor matches the security of aggressive defenses while improving explainability and preserving utility and latency of AI agents.

Key Contributions

Formalizes IPI as a measurable 'dominance shift' where an untrusted span overtakes the user request in leave-one-out causal attribution at privileged decision points
CausalArmor: a selective defense that triggers targeted sanitization only when attribution margin exceeds a threshold, avoiding always-on overhead in benign scenarios
Retroactive Chain-of-Thought masking to remove poisoned reasoning traces from agent context history before action execution

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

inference_timeblack_box

Datasets

AgentDojoDoomArena

Applications

llm agentstool-calling ai agentsagentic ai systems

Read PDF arXiv DOI

CausalArmor: Efficient Indirect Prompt Injection Guardrails via Causal Attribution

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

ceLLMate: Sandboxing Browser AI Agents

AgentWatcher: A Rule-based Prompt Injection Monitor

AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection

Secure and Efficient Access Control for Computer-Use Agents via Context Space

AgenTRIM: Tool Risk Mitigation for Agentic AI

Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection

Cognitive Control Architecture (CCA): A Lifecycle Supervision Framework for Robustly Aligned AI Agents

Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection