When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models
Zhixiang Guo 1, Siyuan Liang 1, Andras Balogh 2, Noah Lunberry 1, Rong-Cheng Tu 1, Mark Jelasity 2, Dacheng Tao 1
Published on arXiv
2602.18739
Input Manipulation Attack
OWASP ML Top 10 — ML01
Key Finding
Achieves 0.55 attack success rate under targeted setting while increasing FID by only ~9% and FVD by ~3.9%, with downstream training on attacked videos worsening planning performance by ~20%.
PhysCond-WMA
Novel technique introduced
Generative world models (WMs) are increasingly used to synthesize controllable, sensor-conditioned driving videos, yet their reliance on physical priors exposes novel attack surfaces. In this paper, we present Physical-Conditioned World Model Attack (PhysCond-WMA), the first white-box world model attack that perturbs physical-condition channels, such as HDMap embeddings and 3D-box features, to induce semantic, logic, or decision-level distortion while preserving perceptual fidelity. PhysCond-WMA is optimized in two stages: (1) a quality-preserving guidance stage that constrains reverse-diffusion loss below a calibrated threshold, and (2) a momentum-guided denoising stage that accumulates target-aligned gradients along the denoising trajectory for stable, temporally coherent semantic shifts. Extensive experimental results demonstrate that our approach remains effective while increasing FID by about 9% on average and FVD by about 3.9% on average. Under the targeted attack setting, the attack success rate (ASR) reaches 0.55. Downstream studies further show tangible risk, which using attacked videos for training decreases 3D detection performance by about 4%, and worsens open-loop planning performance by about 20%. These findings has for the first time revealed and quantified security vulnerabilities in generative world models, driving more comprehensive security checkers.
Key Contributions
- PhysCond-WMA: the first white-box adversarial attack targeting physical conditioning channels (HDMap, 3D-box) of diffusion-based generative world models for autonomous driving
- Two-stage optimization pipeline: a quality-preserving guidance stage (reverse-diffusion loss threshold) and a momentum-guided denoising stage for stable, temporally coherent semantic shifts
- Quantitative demonstration of downstream risks — attacked synthetic videos used for training degrade 3D detection by ~4% and worsen open-loop planning by ~20%
🛡️ Threat Analysis
PhysCond-WMA crafts adversarial perturbations on physical-condition input channels (HDMap embeddings, 3D-box features) to manipulate the diffusion-based world model's outputs at inference time, inducing semantic and decision-level distortion — a classic input manipulation attack adapted for conditional generative models.