attack 2025

See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems

Halima Bouzidi , Haoyu Liu , Mohammad Abdullah Al Faruque

0 citations

α

Published on arXiv

2509.02028

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

VEIL's adversarial perturbations successfully induce track ID switches and terminations in E2E RMOT models, with FIFO memory attacks causing errors that persist over multiple subsequent frames.

VEIL

Novel technique introduced


Language-vision understanding has driven the development of advanced perception systems, most notably the emerging paradigm of Referring Multi-Object Tracking (RMOT). By leveraging natural-language queries, RMOT systems can selectively track objects that satisfy a given semantic description, guided through Transformer-based spatial-temporal reasoning modules. End-to-End (E2E) RMOT models further unify feature extraction, temporal memory, and spatial reasoning within a Transformer backbone, enabling long-range spatial-temporal modeling over fused textual-visual representations. Despite these advances, the reliability and robustness of RMOT remain underexplored. In this paper, we examine the security implications of RMOT systems from a design-logic perspective, identifying adversarial vulnerabilities that compromise both the linguistic-visual referring and track-object matching components. Additionally, we uncover a novel vulnerability in advanced RMOT models employing FIFO-based memory, whereby targeted and consistent attacks on their spatial-temporal reasoning introduce errors that persist within the history buffer over multiple subsequent frames. We present VEIL, a novel adversarial framework designed to disrupt the unified referring-matching mechanisms of RMOT models. We show that carefully crafted digital and physical perturbations can corrupt the tracking logic reliability, inducing track ID switches and terminations. We conduct comprehensive evaluations using the Refer-KITTI dataset to validate the effectiveness of VEIL and demonstrate the urgent need for security-aware RMOT designs for critical large-scale applications.


Key Contributions

  • Identifies adversarial vulnerabilities in the linguistic-visual referring and track-object matching components of RMOT systems from a design-logic perspective.
  • Discovers a novel FIFO-based memory persistence vulnerability whereby targeted perturbations introduce temporal errors that propagate across subsequent frames in advanced RMOT models.
  • Proposes VEIL, a unified adversarial framework producing digital and physical perturbations that cause track ID switches and terminations, evaluated on Refer-KITTI.

🛡️ Threat Analysis

Input Manipulation Attack

VEIL crafts gradient-based adversarial perturbations (both digital and physical) applied at inference time to corrupt the tracking logic of multimodal Transformer-based RMOT models, inducing track ID switches and terminations — a classic input manipulation/evasion attack targeting a vision-language system.


Details

Domains
visionmultimodal
Model Types
transformervlm
Threat Tags
white_boxinference_timetargeteddigitalphysical
Datasets
Refer-KITTI
Applications
multi-object trackingreferring multi-object trackingautonomous driving perception