Penetration Testing of Agentic AI: A Comparative Security Analysis Across Models and Frameworks
Viet K. Nguyen , Mohammad I. Husain
Published on arXiv
2512.14860
Prompt Injection
OWASP LLM Top 10 — LLM01
Insecure Plugin Design
OWASP LLM Top 10 — LLM07
Excessive Agency
OWASP LLM Top 10 — LLM08
Key Finding
More than half of malicious prompts succeeded across all configurations (41.5% overall refusal rate), with Grok 2 on CrewAI rejecting only 2 of 13 attacks (15.4%), demonstrating that current enterprise-grade safety mechanisms are insufficient for agentic deployments.
Agentic AI introduces security vulnerabilities that traditional LLM safeguards fail to address. Although recent work by Unit 42 at Palo Alto Networks demonstrated that ChatGPT-4o successfully executes attacks as an agent that it refuses in chat mode, there is no comparative analysis in multiple models and frameworks. We conducted the first systematic penetration testing and comparative evaluation of agentic AI systems, testing five prominent models (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, and Nova Pro) across two agentic AI frameworks (AutoGen and CrewAI) using a seven-agent architecture that mimics the functionality of a university information management system and 13 distinct attack scenarios that span prompt injection, Server Side Request Forgery (SSRF), SQL injection, and tool misuse. Our 130 total test cases reveal significant security disparities: AutoGen demonstrates a 52.3% refusal rate versus CrewAI's 30.8%, while model performance ranges from Nova Pro's 46.2% to Claude and Grok 2's 38.5%. Most critically, Grok 2 on CrewAI rejected only 2 of 13 attacks (15.4% refusal rate), and the overall refusal rate of 41.5% across all configurations indicates that more than half of malicious prompts succeeded despite enterprise-grade safety mechanisms. We identify six distinct defensive behavior patterns including a novel "hallucinated compliance" strategy where models fabricate outputs rather than executing or refusing attacks, and provide actionable recommendations for secure agent deployment. Complete attack prompts are also included in the Appendix to enable reproducibility.
Key Contributions
- First systematic comparative penetration testing of five LLMs (Claude 3.5 Sonnet, Gemini 2.5 Flash, GPT-4o, Grok 2, Nova Pro) across two agentic frameworks (AutoGen and CrewAI) using 130 standardized test cases across 13 attack scenarios
- Quantifies significant security disparities: AutoGen shows 52.3% refusal rate vs. CrewAI's 30.8%; model refusal ranges from 38.5% (Claude/Grok) to 46.2% (Nova Pro); Grok 2 on CrewAI refuses only 15.4% of attacks
- Identifies six defensive behavior patterns including the novel 'hallucinated compliance' strategy where models fabricate outputs instead of executing or refusing, and provides deployment recommendations