benchmark 2026

$α^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks

Mohamed Amine Ferrag ¹, Abderrahmane Lakas ¹, Merouane Debbah ²

¹ United Arab Emirates University

² Khalifa University of Science and Technology

1 citations · 28 references · arXiv

Published on arXiv

2601.18754

Prompt Injection

OWASP LLM Top 10 — LLM01

Excessive Agency

OWASP LLM Top 10 — LLM08

Key Finding

Across 23 LLMs evaluated on 175 threat types, normalized security scores range from 12.9% to 57.1%, with anomaly detection far outperforming vulnerability attribution and adversarial mitigation.

α³-SecBench

Novel technique introduced

Autonomous unmanned aerial vehicle (UAV) systems are increasingly deployed in safety-critical, networked environments where they must operate reliably in the presence of malicious adversaries. While recent benchmarks have evaluated large language model (LLM)-based UAV agents in reasoning, navigation, and efficiency, systematic assessment of security, resilience, and trust under adversarial conditions remains largely unexplored, particularly in emerging 6G-enabled settings. We introduce $α^{3}$-SecBench, the first large-scale evaluation suite for assessing the security-aware autonomy of LLM-based UAV agents under realistic adversarial interference. Building on multi-turn conversational UAV missions from $α^{3}$-Bench, the framework augments benign episodes with 20,000 validated security overlay attack scenarios targeting seven autonomy layers, including sensing, perception, planning, control, communication, edge/cloud infrastructure, and LLM reasoning. $α^{3}$-SecBench evaluates agents across three orthogonal dimensions: security (attack detection and vulnerability attribution), resilience (safe degradation behavior), and trust (policy-compliant tool usage). We evaluate 23 state-of-the-art LLMs from major industrial providers and leading AI labs using thousands of adversarially augmented UAV episodes sampled from a corpus of 113,475 missions spanning 175 threat types. While many models reliably detect anomalous behavior, effective mitigation, vulnerability attribution, and trustworthy control actions remain inconsistent. Normalized overall scores range from 12.9% to 57.1%, highlighting a significant gap between anomaly detection and security-aware autonomous decision-making. We release $α^{3}$-SecBench on GitHub: https://github.com/maferrag/AlphaSecBench

Key Contributions

First large-scale security evaluation suite (α³-SecBench) for LLM-based UAV agents, comprising 20,000 validated adversarial attack scenarios across 7 autonomy layers and 175 threat types drawn from 113,475 missions
Three-dimensional evaluation framework covering security (attack detection and vulnerability attribution), resilience (safe degradation), and trust (policy-compliant tool usage) for 23 state-of-the-art LLMs
Empirical finding that current LLMs reliably detect anomalies but exhibit major gaps in mitigation, vulnerability attribution, and trustworthy control, with normalized scores ranging from 12.9% to 57.1%

🛡️ Threat Analysis

Details

Domains

multimodalreinforcement-learningnlp

Model Types

llm

Threat Tags

inference_timeblack_boxtargeted

Datasets

α³-Bench (113,475 UAV missions, 20,000 adversarial overlays)

Applications

autonomous uav systemsllm-based autonomous agents6g-enabled networked robotics

Read PDF arXiv DOI Code

$α^3$-SecBench: A Large-Scale Evaluation Suite of Security, Resilience, and Trust for LLM-based UAV Agents over 6G Networks

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

How Vulnerable Are AI Agents to Indirect Prompt Injections? Insights from a Large-Scale Public Competition

ClawSafety: "Safe" LLMs, Unsafe Agents

OpenSec: Measuring Incident Response Agent Calibration Under Adversarial Evidence

Beyond Model Jailbreak: Systematic Dissection of the "Ten DeadlySins" in Embodied Intelligence

Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent

ScamAgents: How AI Agents Can Simulate Human-Level Scam Calls

SecureWebArena: A Holistic Security Evaluation Benchmark for LVLM-based Web Agents

PEAR: Planner-Executor Agent Robustness Benchmark