Tracking Capabilities for Safer Agents

AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.

Key Contributions

Capability-safe agent safety harness (TACIT) using Scala 3 capture checking to statically track and enforce fine-grained agent permissions over tools and effects
Local purity enforcement that prevents information leakage by ensuring sub-computations processing sensitive/classified data are provably side-effect-free
Empirical validation showing LLM agents generate capability-safe code with no significant degradation in task performance while the type system blocks unsafe behaviors