Progress in video generation may soon make it possible to evaluate robot policies in a completely learned world model. An end-to-end learned simulator of millions of robot environments would greatly accelerate progress in general-purpose robotics and provide a useful signal for scaling data and compute.
To accelerate progress in learned simulators for robots, we're announcing the 1X World Model Challenge, where the task is to predict future first-person observations of the EVE Android. We provide over 100 hours of vector-quantized image tokens and raw actions collected from operating EVE at 1X offices, baseline world model (GENIE-style), and a frame-level MAGVIT2 autoencoder that compresses images into 16x16 tokens and decodes them back into images.
We hope that this dataset will be helpful to roboticists who want to experiment with a diverse set of general-purpose robotics data in human environments. A sufficiently powerful world model will allow anyone to access a "neurally-simulated EVE". The evaluation challenge is the ultimate goal, and we have cash prizes for intermediate goals like fitting the data well (compression challenge) and sampling plausible videos (sampling challenge).
Each example is a sequence of 16 first-person images from the robot at 2Hz (so 8 seconds total), and your goal is to predict the next image given the previous ones.
-
Compression Challenge ($10k prize): Predict the discrete distribution of tokens in the next image.
- Criteria: Be the first to achieve a temporally teacher-forced loss below 8.0 on our private test set.
- Sampling Challenge ($10k prize): Future prediction methods are not necessarily restricted to next-logit prediction. You can, for example, use methods like GANs, Diffusion, and MaskGIT to generate future images. Criteria will be released shortly.
-
Evaluation Challenge (upcoming): given a set of N policies,
$\pi_1, \pi_2, ... \pi_N$ , where each policy$\pi_i(a_t|z_t)$ predicts action tokens from image tokens, can you evaluate all of the policies inside a "world model"$p(z_{t+1}|z_t, a_t)$ and tell us the ranked order of which policy is the best?
These challenges are largely inspired by the commavq compression challenge. Please read the Additional Challenge Details
We require Python 3.10
or later. This code was tested with Python 3.10.12
.
# Install dependencies and download data
./build.sh
# Source the Python environment
source venv/bin/activate
This repo provides an implementation of the spatio-temporal transformer and MaskGIT sampler as described in Genie: Generative Interactive Environments. Note that this implementation only trains on video sequences, not actions (though it is trivial to add this via an additive embedding).
# Train the GENIE model
python train.py --genie_config genie/configs/magvit_n32_h8_d256.json --output_dir data/genie_model --max_eval_steps 10
# Generate frames from trained model
python genie/generate.py --checkpoint_dir data/genie_model/final_checkpt
# Visualize generated frames
python visualize.py --token_dir data/genie_generated
# Evaluate the trained model
python genie/evaluate.py --checkpoint_dir data/genie_model/final_checkpt
We provide two pre-trained GENIE models, linked in the leaderboard.
# Generate and visualize
output_dir='data/genie_baseline_generated'
for i in {0..240..10}; do
python genie/generate.py --checkpoint_dir 1x-technologies/GENIE_138M \
--output_dir $output_dir --example_ind $i --maskgit_steps 2 --temperature 0
python visualize.py --token_dir $output_dir
mv $output_dir/generated_offset0.gif $output_dir/example_$i.gif
mv $output_dir/generated_comic_offset0.png $output_dir/example_$i.png
done
# Evaluate
python genie/evaluate.py --checkpoint_dir 1x-technologies/GENIE_138M --maskgit_steps 2
See the Dataset Card on Huggingface.
The training dataset is stored in the data/train_v1.1
directory.
Please read the Additional Challenge Details first for clarification on rules.
Email source code + build script + some info about your approach to [email protected]. We will evaluate your submission on our held-out dataset and email you back with the results.
Please send us the following:
- your chosen username (can be your real name or a pseudonym, will be tied 1:1 to your email)
- source code as a .zip file
- how many flops you used (approximately) to train the model
- any external data you may have used to train your model
- eval performance you got on the provided validation set (so we know roughly what you expect from your model)
After manually reviewing your code, we run evals in a 22.04 + CUDA 12.3 sandboxed environment like so:
./build.sh # installs any dependencies + model weights you need
./evaluate.py --val_data_dir <PATH-TO-HELD-OUT-DATA> # runs your model on held-out data
- We've provided
magvit2.ckpt
in the dataset download, which are the weights for a MAGVIT2 encoder/decoder. The encoder allows you to tokenize external data to try to improve the metric. - The loss metric is nonstandard compared to LLMs due to the vocabulary size of the image tokens, which was changed as of v1.0 release (Jul 8, 2024). Instead of computing cross entropy loss on logits with 2^18 classes, we compute cross entropy losses on 2x 2^9 class predictions and sum them up. The rationale for this is that the large vocabulary size (2^18) makes it very memory-intensive to store a logit tensor of size
(B, 2^18, T, 16, 16)
. Therefore, the compression challenge considers families of models with a factorized pmfs of the form p(x1, x2) = p(x1)p(x2). For sampling and evaluation challenge, a factorized pmf is a necessary criteria. - For the compression challenge, we are making the deliberate choice to evaluate held-out data on the same factorized distribution p(x1, x2) = p(x1)p(x2) that we train on. Although unfactorized models of the form p(x1, x2) = f(x1, x2) ought to achieve lower cross entropy on test data by exploiting the off-block-diagonal terms of Cov(x1, x2), we want to encourage solutions that achieve lower losses while holding the factorization fixed.
- For the compression challenge, submissions may only use the prior actions to the current prompt frame. Submissions can predict subsequent actions autoregressively to improve performance, but these actions will not be provided with the prompt.
- Naive nearest-neighbor retrieval + seeking ahead to next frames from the training set will achieve reasonably good losses and sampling results on the dev-validation set, because there are similar sequences in the training set. However, we explicitly forbid these kinds of solutions (and the private test set penalizes these kinds of solutions).
- We will not be able to award prizes to individuals in U.S. sanctioned countries. We reserve the right to not award a prize if it violates the spirit of the challenge.
There are different scenarios for evaluation, which vary in the degree of ground truth context the model receives. In decreasing order of context, these scenarios are:
- Fully Autoregressive: the model receives a predetermined number of ground truth frames and autoregressively predicts all remaining frames.
- Temporally Teacher-forced: the model receives all ground truth frames before the current frame and autoregressively predicts all tokens in the current frame.
- Fully Teacher-forced: the model receives all ground truth tokens before the current token, including tokens in the current frame. Only applicable for causal LMs.
As an example, consider predicting the final token of a video, corresponding to the lower right patch of frame 15. The context the model receives in each scenario is:
- Fully Autoregressive: the first $t$x16x16 tokens are ground truth tokens corresponding to the first
$t$ prompt frames, and all remaining tokens are autoregressively generated, where$0 < t < 15$ is the predetermined number of prompt frames. - Temporally Teacher-forced: the first 15x16x16 tokens are ground truth tokens corresponding to the first 15 frames, and all remaining tokens are autoregressively generated.
- Fully Teacher-forced: all previous (16x16x16 - 1) tokens are ground truth tokens.
The compression challenge uses the "temporally teacher-forced" scenario.
These are evaluation results on data/val_v1.1
.
User | Temporally Teacher-forced CE Loss |
Temporally Teacher-forced Token Accuracy |
Temporally Teacher-forced LPIPS |
Generation Time* (secs/frame) |
---|---|---|---|---|
1x-technologies/GENIE_138M ( --maskgit_steps 2 ) |
8.79 | 0.0320 | 0.207 | 0.075 |
1x-technologies/GENIE_35M ( --maskgit_steps 2 ) |
8.99 | 0.0301 | 0.217 | 0.030 |
Beyond the World Model Challenge, we also want to make the challenges and datasets more useful for your research questions. Want more data interacting with humans? More safety-critical tasks like carrying cups of hot coffee without spilling? More dextrous tool use? Robots working with other robots? Robots dressing themselves in the mirror? Think of 1X as the operations team for getting you high quality humanoid data in extremely diverse scenarios.
Email [email protected] with your requests (and why you think the data is important) and we will try to include it in a future data release. You can also discuss your data questions with the community on Discord.
We also welcome donors to help us increase the bounty.
If you use this software or dataset in your work, please cite it using the "Cite this repository" button on Github.
- v1.1 - Release compression challenge criteria; removed pauses and discontinuous videos from dataset; higher image crop.
- v1.0 - More efficient MAGVIT2 tokenizer with 16x16 (C=2^18) mapping to 256x256 images, providing raw action data.
- v0.0.1 - Initial challenge release with 20x20 (C=1000) image tokenizer mapping to 160x160 images.
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property | value | ||||
---|---|---|---|---|---|
name | 1X World Model Challenge |
||||
url | https://github.com/1x-technologies/1xgpt |
||||
description | A dataset of over 100 hours of compressed image tokens + raw actions across a fleet of EVE robots. |
||||
provider |
|
||||
license |
|