Cross-Modal Phantom: Coordinated Camera-LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles

Autonomous Vehicles (AVs) increasingly depend on Multi-Sensor Fusion (MSF) to combine complementary modalities such as cameras and LiDAR for robust perception. While this redundancy is intended to safeguard against single-sensor failures, the fusion process itself introduces a subtle and underexplored vulnerability. In this work, we investigate whether an attacker can bypass MSF's redundancy by fabricating cross-sensor consistency, making multiple sensors agree on the same false object. We design a coordinated, data-level (early-fusion) attack that emulates the outcome of two synchronized physical spoofing sources: an infrared (IR) projection that induces a false camera detection and a LiDAR signal injection that produces a matching 3D point cluster. Rather than implementing the physical attack hardware, we simulate its sensor-level outcomes by inserting perspective-aware image patches and synthetic LiDAR point clusters aligned in 3D space. This approach preserves the perceptual effects that real IR and IEMI-based spoofing would create at the sensor output. Using 400 KITTI scenes, our large-scale evaluation shows that the coordinated spoofing deceives a state-of-the-art perception model with an 85.5% successful attack rate. These findings provide the first quantitative evidence that malicious cross-modal consistency can compromise MSF-based perception, revealing a critical vulnerability in the core data-fusion logic of modern autonomous vehicle systems.

Key Contributions

First coordinated cross-modal spoofing attack exploiting MSF's reliance on cross-sensor consistency
Data-level attack design simulating synchronized IR projection and LiDAR signal injection outcomes
Large-scale evaluation on 400 KITTI scenes demonstrating 85.5% attack success rate against state-of-the-art MSF perception

🛡️ Threat Analysis

Input Manipulation Attack

Attack manipulates inputs to AV perception models at inference time by injecting synchronized false sensor data (IR-induced camera patches + LiDAR point clusters) that cause misdetection of phantom objects. This is input manipulation causing incorrect model outputs during real-time operation.

Details

Domains

visionmultimodal

Model Types

multimodalcnn

Threat Tags

inference_timetargetedphysicaldigital

Datasets

KITTI

Applications

2025 0 cit.

Input Manipulation Attack

69%

Cross-Modal Phantom: Coordinated Camera-LiDAR Spoofing Against Multi-Sensor Fusion in Autonomous Vehicles

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion

Adversarial Patch Generation for Visual-Infrared Dense Prediction Tasks via Joint Position-Color Optimization

Adversarial Patch Attack for Ship Detection via Localized Augmentation

Distillation-Enhanced Physical Adversarial Attacks

IPG: Incremental Patch Generation for Generalized Adversarial Patch Training

Exploring Semantic-constrained Adversarial Example with Instruction Uncertainty Reduction

See No Evil: Adversarial Attacks Against Linguistic-Visual Association in Referring Multi-Object Tracking Systems

A Single Set of Adversarial Clothes Breaks Multiple Defense Methods in the Physical World