benchmark 2025

AVTrustBench: Assessing and Enhancing Reliability and Robustness in Audio-Visual LLMs

Sanjoy Chowdhury 1, Sayan Nag 2, Subhrajyoti Dasgupta 3,4,5, Yaoting Wang 5, Mohamed Elhoseiny 5, Ruohan Gao 1, Dinesh Manocha 1

12 citations · arXiv

α

Published on arXiv

2501.02135

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Most existing AVLLMs fall significantly short of human-level performance on adversarial audio-visual inputs; CAVPref improves robustness by up to 30.19% across all 9 benchmark tasks.

CAVPref

Novel technique introduced


With the rapid advancement of Multi-modal Large Language Models (MLLMs), several diagnostic benchmarks have recently been developed to assess these models' multi-modal reasoning proficiency. However, these benchmarks are restricted to assessing primarily the visual aspect and do not examine the holistic audio-visual (AV) understanding. Moreover, currently, there are no benchmarks that investigate the capabilities of AVLLMs to calibrate their responses when presented with perturbed inputs. To this end, we introduce Audio-Visual Trustworthiness assessment Benchmark (AVTrustBench), comprising 600K samples spanning over 9 meticulously crafted tasks, evaluating the capabilities of AVLLMs across three distinct dimensions: Adversarial attack, Compositional reasoning, and Modality-specific dependency. Using our benchmark we extensively evaluate 13 state-of-the-art AVLLMs. The findings reveal that the majority of existing models fall significantly short of achieving human-like comprehension, offering valuable insights for future research directions. To alleviate the limitations in the existing approaches, we further propose a robust, model-agnostic calibrated audio-visual preference optimization based training strategy CAVPref, obtaining a gain up to 30.19% across all 9 tasks. We will publicly release our code and benchmark to facilitate future research in this direction.


Key Contributions

  • AVTrustBench: a 600K-sample audio-visual trustworthiness benchmark spanning 9 tasks across adversarial attack, compositional reasoning, and modality-specific dependency dimensions
  • Systematic evaluation of 13 state-of-the-art AVLLMs revealing significant gaps relative to human-level performance
  • CAVPref: a model-agnostic calibrated audio-visual preference optimization training strategy achieving up to 30.19% gain across all 9 tasks

🛡️ Threat Analysis

Input Manipulation Attack

The Adversarial Attack dimension of AVTrustBench explicitly tests model robustness to perturbed audio-visual inputs (input manipulation at inference time), and CAVPref is proposed as a defense to improve robustness against these adversarial perturbations.


Details

Domains
audiomultimodalnlp
Model Types
llmmultimodal
Threat Tags
inference_time
Datasets
AVTrustBench (introduced)
Applications
audio-visual question answeringmultimodal reasoning