benchmark 2025

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

Viet K. Nguyen , Mohammad I. Husain

Cal Poly Pomona

0 citations · 29 references · arXiv

Published on arXiv

2512.14860

Prompt Injection

OWASP LLM Top 10 — LLM01

Insecure Plugin Design

OWASP LLM Top 10 — LLM07

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

More than half of malicious prompts succeeded across all configurations (41.5% overall refusal rate), with Grok 2 on CrewAI rejecting only 2 of 13 attacks (15.4%), demonstrating that current enterprise-grade safety mechanisms are insufficient for agentic deployments.

Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI's 30.8%, while model performance ranges from Nova Pro's 46.2% to Claude and Grok 2's 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.

Key Contributions

First systematic comparative penetration testing of five LLMs (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, Nova Pro) across two agentic frameworks (AutoGen and CrewAI) using 130 standardized test cases across 13 attack scenarios
Quantifies significant security disparities: AutoGen shows 52.3% refusal rate vs. CrewAI's 30.8%; model refusal ranges from 38.5% (Claude/Grok) to 46.2% (Nova Pro); Grok 2 on CrewAI refuses only 15.4% of attacks
Identifies six defensive behavior patterns including the novel 'hallucinated compliance' strategy where models fabricate outputs instead of executing or refusing, and provides deployment recommendations

🛡️ Threat Analysis

Details

Domains

nlp

Model Types

llm

Threat Tags

black_boxinference_time

Applications

agentic ai systemsmulti-agent frameworksenterprise ai deployment

Read PDF arXiv DOI

Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

From Assistant to Double Agent: Formalizing and Benchmarking Attacks on OpenClaw for Personalized Local AI Agent

Evaluating Privilege Usage of Agents on Real-World Tools

Agents of Chaos

Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents

Agentic AI as a Cybersecurity Attack Surface: Threats, Exploits, and Defenses in Runtime Supply Chains

Governance Architecture for Autonomous Agent Systems: Threats, Framework, and Engineering Practice

Systems Security Foundations for Agentic Computing

Securing the Model Context Protocol (MCP): Risks, Controls, and Governance