ASTRA: Agentic Steerability and Risk Assessment Framework
Itay Hazan , Yael Mathov , Guy Shtar , Ron Bitton , Itsik Mantin
Published on arXiv
2511.18114
Excessive Agency
OWASP LLM Top 10 — LLM08
Prompt Injection
OWASP LLM Top 10 — LLM01
Key Finding
Evaluation of 13 open-source LLMs revealed significant differences in their ability to enforce system-prompt-level security guardrails when operating as autonomous agents with tool access.
ASTRA
Novel technique introduced
Securing AI agents powered by Large Language Models (LLMs) represents one of the most critical challenges in AI security today. Unlike traditional software, AI agents leverage LLMs as their "brain" to autonomously perform actions via connected tools. This capability introduces significant risks that go far beyond those of harmful text presented in a chatbot that was the main application of LLMs. A compromised AI agent can deliberately abuse powerful tools to perform malicious actions, in many cases irreversible, and limited solely by the guardrails on the tools themselves and the LLM ability to enforce them. This paper presents ASTRA, a first-of-its-kind framework designed to evaluate the effectiveness of LLMs in supporting the creation of secure agents that enforce custom guardrails defined at the system-prompt level (e.g., "Do not send an email out of the company domain," or "Never extend the robotic arm in more than 2 meters"). Our holistic framework simulates 10 diverse autonomous agents varying between a coding assistant and a delivery drone equipped with 37 unique tools. We test these agents against a suite of novel attacks developed specifically for agentic threats, inspired by the OWASP Top 10 but adapted to challenge the ability of the LLM for policy enforcement during multi-turn planning and execution of strict tool activation. By evaluating 13 open-source, tool-calling LLMs, we uncovered surprising and significant differences in their ability to remain secure and keep operating within their boundaries. The purpose of this work is to provide the community with a robust and unified methodology to build and validate better LLMs, ultimately pushing for more secure and reliable agentic AI systems.
Key Contributions
- ASTRA: a first-of-its-kind evaluation framework simulating 10 diverse autonomous agents with 37 unique tools to assess LLM security in agentic settings
- A suite of novel agentic attacks inspired by OWASP Top 10, adapted to challenge policy enforcement during multi-turn planning and strict tool activation
- Comparative evaluation of 13 open-source tool-calling LLMs revealing significant and surprising differences in their ability to enforce custom guardrails