benchmark 2025

Safeguarding Efficacy in Large Language Models: Evaluating Resistance to Human-Written and Algorithmic Adversarial Prompts

Tiarnaigh Downey-Webb , Olamide Jogunola , Oluwaseun Ajao

0 citations · 23 references · arXiv

α

Published on arXiv

2510.15973

Input Manipulation Attack

OWASP ML Top 10 — ML01

Prompt Injection

OWASP LLM Top 10 — LLM01

Key Finding

Llama-2 achieves the lowest attack success rate (3.4%) while Phi-2 is most vulnerable (7.0%); GCG/TAP attacks transfer with up to 17% success rate to GPT-4 despite failing on their primary Llama-2 target.


This paper presents a systematic security assessment of four prominent Large Language Models (LLMs) against diverse adversarial attack vectors. We evaluate Phi-2, Llama-2-7B-Chat, GPT-3.5-Turbo, and GPT-4 across four distinct attack categories: human-written prompts, AutoDAN, Greedy Coordinate Gradient (GCG), and Tree-of-Attacks-with-pruning (TAP). Our comprehensive evaluation employs 1,200 carefully stratified prompts from the SALAD-Bench dataset, spanning six harm categories. Results demonstrate significant variations in model robustness, with Llama-2 achieving the highest overall security (3.4% average attack success rate) while Phi-2 exhibits the greatest vulnerability (7.0% average attack success rate). We identify critical transferability patterns where GCG and TAP attacks, though ineffective against their target model (Llama-2), achieve substantially higher success rates when transferred to other models (up to 17% for GPT-4). Statistical analysis using Friedman tests reveals significant differences in vulnerability across harm categories ($p < 0.001$), with malicious use prompts showing the highest attack success rates (10.71% average). Our findings contribute to understanding cross-model security vulnerabilities and provide actionable insights for developing targeted defense mechanisms


Key Contributions

  • Systematic comparative security evaluation of four prominent LLMs (Phi-2, Llama-2, GPT-3.5-Turbo, GPT-4) across four adversarial attack types using 1,200 stratified prompts
  • Discovery of critical cross-model transferability: GCG and TAP attacks ineffective against Llama-2 achieve up to 17% success on GPT-4
  • Statistical analysis (Friedman tests) identifying significant vulnerability differences across six harm categories, with malicious-use prompts peaking at 10.71% average ASR

🛡️ Threat Analysis

Input Manipulation Attack

GCG (Greedy Coordinate Gradient) is a gradient-based adversarial suffix optimization attack — a token-level perturbation technique qualifying as an input manipulation attack; the paper evaluates its success rates and transferability across models.


Details

Domains
nlp
Model Types
llm
Threat Tags
white_boxblack_boxinference_timetargeted
Datasets
SALAD-Bench
Applications
large language model safetyjailbreak resistance evaluation