attack 2026

When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

Zhixiang Guo 1, Siyuan Liang 1, Andras Balogh 2, Noah Lunberry 1, Rong-Cheng Tu 1, Mark Jelasity 2, Dacheng Tao 1

0 citations · 67 references · arXiv (Cornell University)

α

Published on arXiv

2602.18739

Input Manipulation Attack

OWASP ML Top 10 — ML01

Key Finding

Achieves 0.55 attack success rate under targeted setting while increasing FID by only ~9% and FVD by ~3.9%, with downstream training on attacked videos worsening planning performance by ~20%.

PhysCond-WMA

Novel technique introduced


Generative world models (WMs) are increasingly used to synthesize controllable, sensor-conditioned driving videos, yet their reliance on physical priors exposes novel attack surfaces. In this paper, we present Physical-Conditioned World Model Attack (PhysCond-WMA), the first white-box world model attack that perturbs physical-condition channels, such as HDMap embeddings and 3D-box features, to induce semantic, logic, or decision-level distortion while preserving perceptual fidelity. PhysCond-WMA is optimized in two stages: (1) a quality-preserving guidance stage that constrains reverse-diffusion loss below a calibrated threshold, and (2) a momentum-guided denoising stage that accumulates target-aligned gradients along the denoising trajectory for stable, temporally coherent semantic shifts. Extensive experimental results demonstrate that our approach remains effective while increasing FID by about 9% on average and FVD by about 3.9% on average. Under the targeted attack setting, the attack success rate (ASR) reaches 0.55. Downstream studies further show tangible risk, which using attacked videos for training decreases 3D detection performance by about 4%, and worsens open-loop planning performance by about 20%. These findings has for the first time revealed and quantified security vulnerabilities in generative world models, driving more comprehensive security checkers.


Key Contributions

  • PhysCond-WMA: the first white-box adversarial attack targeting physical conditioning channels (HDMap, 3D-box) of diffusion-based generative world models for autonomous driving
  • Two-stage optimization pipeline: a quality-preserving guidance stage (reverse-diffusion loss threshold) and a momentum-guided denoising stage for stable, temporally coherent semantic shifts
  • Quantitative demonstration of downstream risks — attacked synthetic videos used for training degrade 3D detection by ~4% and worsen open-loop planning by ~20%

🛡️ Threat Analysis

Input Manipulation Attack

PhysCond-WMA crafts adversarial perturbations on physical-condition input channels (HDMap embeddings, 3D-box features) to manipulate the diffusion-based world model's outputs at inference time, inducing semantic and decision-level distortion — a classic input manipulation attack adapted for conditional generative models.


Details

Domains
visiongenerative
Model Types
diffusion
Threat Tags
white_boxinference_timetargeteddigital
Datasets
nuScenes
Applications
autonomous driving world modelsdriving video synthesisautonomous driving planning