π€ DreamZero-DROID: World Action Models are Zero-shot Policies

DreamZero-DROID is a 14B parameter World Action Model (WAM) checkpoint trained from scratch using only the DROID dataset. Unlike traditional Vision-Language-Action (VLA) models, DreamZero learns physical dynamics by predicting future world states and actions jointly, using video as a dense representation of how the world evolves.
This specific checkpoint demonstrates the strength of video-model backbones for generalist robot policies, achieving strong zero-shot performance on unseen tasks without requiring pretraining on massive, large-scale robot datasets.
ποΈ Model Details
- Architecture: World Action Model (WAM) built upon a pretrained video diffusion backbone (Wan2.1-I2V-14B-480P).
- Parameters: 14 Billion
- Inputs: Visual context (via VAE), language instructions (via text encoder), and proprioceptive state.
- Outputs: Joint autoregressive prediction of future video frames and robot actions.
- Training Data: Trained exclusively on the DROID Dataset (Distributed Robot Interaction Dataset), utilizing ~75k episodes with language annotations.
π Capabilities
- Zero-Shot Generalization: Delivers over 2x improvement in generalization to new tasks and novel environments compared to state-of-the-art VLAs in real robot experiments.
- Real-Time Execution: Through model and system optimizations (DreamZero-Flash), this 14B model is capable of real-time closed-loop control at ~7Hz.
- Joint Video & Action: Learns diverse skills effectively from heterogeneous robot data without relying on highly repetitive demonstrations. Predicted actions closely align with the generated future video states.
π» How to Use
To use this checkpoint for distributed inference or simulation evaluation, please refer to the main GitHub repository.
1. Download the checkpoint via Hugging Face CLI:
huggingface-cli download GEAR-Dreams/DreamZero-DROID --repo-type model --local-dir ./checkpoints/DreamZero-DROID