Benchmarking LLAMA Model Security Against OWASP Top 10 For LLM Applications
Published on arXiv
2601.19970
Prompt Injection
OWASP LLM Top 10 — LLM01
Sensitive Information Disclosure
OWASP LLM Top 10 — LLM06
Key Finding
Llama-Guard-3-1B achieves the highest threat detection rate of 76% at 0.165s latency, while the base Llama-3.1-8B detects 0% of adversarial prompts despite 4.6x longer inference time.
As large language models (LLMs) move from research prototypes to enterprise systems, their security vulnerabilities pose serious risks to data privacy and system integrity. This study benchmarks various Llama model variants against the OWASP Top 10 for LLM Applications framework, evaluating threat detection accuracy, response safety, and computational overhead. Using the FABRIC testbed with NVIDIA A30 GPUs, we tested five standard Llama models and five Llama Guard variants on 100 adversarial prompts covering ten vulnerability categories. Our results reveal significant differences in security performance: the compact Llama-Guard-3-1B model achieved the highest detection rate of 76% with minimal latency (0.165s per test), whereas base models such as Llama-3.1-8B failed to detect threats (0% accuracy) despite longer inference times (0.754s). We observe an inverse relationship between model size and security effectiveness, suggesting that smaller, specialized models often outperform larger general-purpose ones in security tasks. Additionally, we provide an open-source benchmark dataset including adversarial prompts, threat labels, and attack metadata to support reproducible research in AI security, [1].
Key Contributions
- Open-source benchmark dataset of 100 structured adversarial prompts with ground-truth safety labels and metadata covering all 10 OWASP LLM vulnerability categories, using 23 distinct injection techniques.
- Comparative evaluation of 5 base Llama models vs. 5 Llama Guard variants, quantifying detection accuracy, inference latency, and VRAM usage per OWASP category.
- Empirical finding of an inverse relationship between model size and security effectiveness, showing that specialized smaller models (Llama-Guard-3-1B, 76% detection) outperform larger general-purpose ones (Llama-3.1-8B, 0% detection).