benchmark 2026

$α^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks

Mohamed Amine Ferrag 1, Abderrahmane Lakas 1, Merouane Debbah 2

1 citations · 28 references · arXiv

α

Published on arXiv

2601.18754

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Across 23 LLMs evaluated on 175 threat types, normalized security scores range from 12.9% to 57.1%, with anomaly detection far outperforming vulnerability attribution and adversarial mitigation.

α³-SecBench

Novel technique introduced


Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings. We introduce $α^{3}$-SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from $α^{3}$-Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. $α^{3}$-SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage). We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release $α^{3}$-SecBench on GitHub: https://github.com/maferrag/AlphaSecBench


Key Contributions

  • First large-scale security evaluation suite (α³-SecBench) for LLM-based UAV agents, comprising 20,000 validated adversarial attack scenarios across 7 autonomy layers and 175 threat types drawn from 113,475 missions
  • Three-dimensional evaluation framework covering security (attack detection and vulnerability attribution), resilience (safe degradation), and trust (policy-compliant tool usage) for 23 state-of-the-art LLMs
  • Empirical finding that current LLMs reliably detect anomalies but exhibit major gaps in mitigation, vulnerability attribution, and trustworthy control, with normalized scores ranging from 12.9% to 57.1%

🛡️ Threat Analysis


Details

Domains
multimodalreinforcement-learningnlp
Model Types
llm
Threat Tags
inference_timeblack_boxtargeted
Datasets
α³-Bench (113,475 UAV missions, 20,000 adversarial overlays)
Applications
autonomous uav systemsllm-based autonomous agents6g-enabled networked robotics