When World Models Dream Wrong: Physical-Conditioned Adversarial Attacks against World Models

Generative world models (WMs) are increasingly used to synthesize controllable, sensor-conditioned driving videos, yet their reliance on physical priors exposes novel attack surfaces. In this paper, we present Physical-Conditioned World Model Attack (PhysCond-WMA), the first white-box world model attack that perturbs physical-condition channels, such as HDMap embeddings and 3D-box features, to induce semantic, logic, or decision-level distortion while preserving perceptual fidelity. PhysCond-WMA is optimized in two stages: (1) a quality-preserving guidance stage that constrains reverse-diffusion loss below a calibrated threshold, and (2) a momentum-guided denoising stage that accumulates target-aligned gradients along the denoising trajectory for stable, temporally coherent semantic shifts. Extensive experimental results demonstrate that our approach remains effective while increasing FID by about 9% on average and FVD by about 3.9% on average. Under the targeted attack setting, the attack success rate (ASR) reaches 0.55. Downstream studies further show tangible risk, which using attacked videos for training decreases 3D detection performance by about 4%, and worsens open-loop planning performance by about 20%. These findings has for the first time revealed and quantified security vulnerabilities in generative world models, driving more comprehensive security checkers.

Key Contributions

PhysCond-WMA: the first white-box adversarial attack targeting physical conditioning channels (HDMap, 3D-box) of diffusion-based generative world models for autonomous driving
Two-stage optimization pipeline: a quality-preserving guidance stage (reverse-diffusion loss threshold) and a momentum-guided denoising stage for stable, temporally coherent semantic shifts
Quantitative demonstration of downstream risks — attacked synthetic videos used for training degrade 3D detection by ~4% and worsen open-loop planning by ~20%

🛡️ Threat Analysis

Input Manipulation Attack

PhysCond-WMA crafts adversarial perturbations on physical-condition input channels (HDMap embeddings, 3D-box features) to manipulate the diffusion-based world model's outputs at inference time, inducing semantic and decision-level distortion — a classic input manipulation attack adapted for conditional generative models.

Details

Domains

visiongenerative

Model Types

diffusion

Threat Tags

white_boxinference_timetargeteddigital

Datasets

nuScenes

Applications

Key Contributions

🛡️ Threat Analysis

Details

Similar Papers

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

Erased, But Not Forgotten: Erased Rectified Flow Transformers Still Remain Unsafe Under Concept Attack

Immunizing Images from Text to Image Editing via Adversarial Cross-Attention

Revoking Amnesia: RL-based Trajectory Optimization to Resurrect Erased Concepts in Diffusion Models

LURE: Latent Space Unblocking for Multi-Concept Reawakening in Diffusion Models

Arc2Morph: Identity-Preserving Facial Morphing with Arc2Face

CtrlAttack: A Unified Attack on World-Model Control in Diffusion Models

DPAC: Distribution-Preserving Adversarial Control for Diffusion Sampling