CLIPort: What and Where Pathways for Robotic Manipulation
Mohit Shridhar, Lucas Manuelli, Dieter Fox
CoRL 2021
CLIPort is an end-to-end imitation-learning agent that can learn a single language-conditioned policy for various tabletop tasks. The framework combines the broad semantic understanding (what) of CLIP with the spatial precision (where) of TransporterNets to learn generalizable skills from limited training demonstrations.
For the latest updates, see: cliport.github.io
- Getting Started: Installation, Quick Tutorial, Checkpoints & Objects, Hardware Requirements, Model Card
- Data Generation: Dataset, Tasks
- Training & Evaluation: Single Task, Multi Task
- Miscellaneous: Notebooks, Docker Guide, Disclaimers, Real-Robot Training FAQ, Recording Videos
- References: Citations, Acknowledgements
Clone Repo:
git clone https://github.com/cliport/cliport.git
Setup virtualenv and install requirements:
# setup virtualenv with whichever package manager you prefer
virtualenv -p $(which python3.8) --system-site-packages cliport_env
source cliport_env/bin/activate
pip install --upgrade pip
cd cliport
pip install -r requirements.txt
export CLIPORT_ROOT=$(pwd)
python setup.py develop
Note: You might need versions of torch==1.7.1
and torchvision==0.8.2
that are compatible with your CUDA and hardware.
A quick tutorial on evaluating a pre-trained multi-task model.
Download a pre-trained checkpoint for multi-language-conditioned
trained with 1000 demos:
sh scripts/quickstart_download.sh
Generate a small test
set of 10 instances for stack-block-pyramid-seq-seen-colors
inside $CLIPORT_ROOT/data
:
python cliport/demos.py n=10 \
task=stack-block-pyramid-seq-seen-colors \
mode=test
This will take a few minutes to finish.
Evaluate the best validation checkpoint for stack-block-pyramid-seq-seen-colors
on the test set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=10 \
train_demos=1000 \
exp_folder=cliport_quickstart \
checkpoint_type=test_best \
update_results=True \
disp=True
If you are on a headless machine turn off the visualization with disp=False
.
You can evaluate the same multi-language-conditioned
model on other tasks. First generate a val
set for the task and then specify eval_task=<task_name>
with mode=val
and checkpoint_type=val_missing
(the quickstart doesn't include validation results for all tasks; download all task results from here).
Checkout affordance.ipynb to visualize affordance predictions of cliport
on various tasks.
Download center-of-mass (COM) corrected Google Scanned Objects:
sh scripts/google_objects_download.sh
Credit: Google.
This Google Drive Folder contains pre-trained multi-language-conditioned
checkpoints for n=1,10,100,1000
and validation/test result JSONs for all tasks. The *val-results.json
files contain the name of the best checkpoint (from validation) to be evaluated on the test
set.
Note: Google Drive might complain about bandwidth restrictions. I recommend using rclone with API access enabled.
Evaluate the best validation checkpoint on the test set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=10 \
train_demos=100 \
exp_folder=cliport_exps \
checkpoint_type=test_best \
update_results=True \
disp=True
A single NVIDIA GPU with 8.5 to 9.5GB memory should be sufficient for training and evaluation.
Tested with:
- GPU - NVIDIA P100
- CPU - Intel Xeon (Quad Core)
- RAM - 32GB
- OS - Ubuntu 16.04, 18.04
The following is a guide for training everything from scratch. All tasks follow a 4-phase workflow:
- Generate
train
,val
,test
datasets withdemos.py
- Train agents with
train.py
- Run validation with
eval.py
to find the best checkpoint onval
tasks and save*val-results.json
- Evaluate the best checkpoint in
*val-results.json
ontest
tasks witheval.py
Generate a train
set of 1000 demonstrations for stack-block-pyramid-seq-seen-colors
inside $CLIPORT_ROOT/data
:
python cliport/demos.py n=1000 \
task=stack-block-pyramid-seq-seen-colors \
mode=train
You can also do a sequential sweep with -m
and comma-separated params task=towers-of-hanoi-seq-seen-colors,stack-block-pyramid-seq-seen-colors
. Use disp=True
to visualize the data generation.
Run generate_dataset.sh
to generate the full dataset and save it to $CLIPORT_ROOT/data
:
sh scripts/generate_dataset.sh data
Note: This script is not parallelized and will take a long time (maybe days) to finish. The full dataset requires ~1.6TB of storage, which includes both language-conditioned and demo-conditioned (original TransporterNets) tasks. It's recommend that you start with single-task training if you don't have enough storage space.
Make sure you have a train
(n demos) and val
(100 demos) set for the task you want to train on.
Train a cliport
agent with 1000
demonstrations on the stack-block-pyramid-seq-seen-colors
task for 200K iterations:
python cliport/train.py train.task=stack-block-pyramid-seq-seen-colors \
train.agent=cliport \
train.attn_stream_fusion_type=add \
train.trans_stream_fusion_type=conv \
train.lang_fusion_type=mult \
train.n_demos=1000 \
train.n_steps=201000 \
train.exp_folder=exps \
dataset.cache=False
Iteratively evaluate all the checkpoints on val
and save the results in exps/<task>-train/checkpoints/<task>-val-results.json
:
python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=val \
n_demos=100 \
train_demos=1000 \
checkpoint_type=val_missing \
exp_folder=exps
Choose the best checkpoint from validation to run on the test
set and save the results in exps/<task>-train/checkpoints/<task>-test-results.json
:
python cliport/eval.py eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=100 \
train_demos=1000 \
checkpoint_type=test_best \
exp_folder=exps
Train multi-task models by specifying task=multi-language-conditioned
, task=multi-attr-packing-box-pairs-unseen-colors
etc.
python cliport/train.py train.task=multi-language-conditioned \
train.agent=cliport \
train.attn_stream_fusion_type=add \
train.trans_stream_fusion_type=conv \
train.lang_fusion_type=mult \
train.n_demos=1000 \
train.n_steps=601000 \
dataset.cache=False \
train.exp_folder=exps \
dataset.type=multi
Important: You need to generate the full dataset of tasks specified in dataset.py
before multi-task training or modify the list of tasks here.
Run validation with a trained multi-language-conditioned
multi-task model on stack-block-pyramid-seq-seen-colors
:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=val \
n_demos=100 \
train_demos=1000 \
checkpoint_type=val_missing \
type=single \
exp_folder=exps
Evaluate the best checkpoint on the test
set:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=100 \
train_demos=1000 \
checkpoint_type=test_best \
type=single \
exp_folder=exps
To save high-resolution videos of agent executions, set record.save_video=True
:
python cliport/eval.py model_task=multi-language-conditioned \
eval_task=stack-block-pyramid-seq-seen-colors \
agent=cliport \
mode=test \
n_demos=10 \
train_demos=100 \
exp_folder=cliport_exps \
checkpoint_type=test_best \
update_results=True \
disp=True \
record.save_video=True
This will save videos inside ${model_dir}/${exp_folder}/${eval_task}-${agent}-n${train_demos}-train/videos/
.
Note: Rendering at high-resolutions is super slow and will take a long time to finish.
- Code Quality Level: Tired grad student.
- Scaling: The code only works for batch size 1. See #issue1 for reference. In theory, there is nothing preventing larger batch sizes other than GPU memory constraints.
- Memory and Storage: There are lots of places where memory usage can be reduced. You don't need 3 copies of the same CLIP ResNet50 and you don't need to save its weights in checkpoints since it's frozen anyway. Dataset sizes could be dramatically reduced with better storage formats and compression.
- Frameworks: There are lots of leftover NumPy bits from when I was trying to reproduce the TransportNets results. I'll try to clean up when I get some time.
- Rotation Augmentation: All tasks use the same distribution for sampling SE(2) rotation perturbations. This obviously leads to issues with tasks that involve spatial relationships like 'left' or 'forward'.
- Evaluation Runs: In an ideal setting, the evaluation metrics should be averaged over 3 or more repetitions with different seeds. This might be feasible if you are working just with multi-task models.
- Duplicate Training Sets: The train sets of some
*seen
and*unseen
tasks are identical, and only the val and test sets differ for purposes of evaluating generalization performance. So you might not need two duplicate train sets or train two separate models. - Image Resolution: The input resolution of
320 x 160
might be too small for some tasks with tiny objects, especially for packing Google objects. Larger resolutions might help improve legibility. - Disadvantaged Multi-Task Models: To avoid cheating on
packing-seen-google-object-*
tasks, the multi-task models are never trained on the fullseen
split of Google Scanned Objects. So a single-task model trained onpacking-seen-google-object-*
will have seen more objects than the comparable multi-task model. - Other Limitations: Checkout Appendix I in the paper.
- CLIP Playground in Colab by Kevin Zakka: A zero-shot object detector with just CLIP. Note that CLIPort does not 'detect objects' but instead directly 'detects actions'.
- Dataset Visualizer: Visualizes raw data and expert labels for pre-generated datasets.
- Affordance Heatmaps: Visualizes affordances from a pre-trained agent like
cliport
. - Evaluation Results: Prints success scores from evaluation runs.
Install Docker and NVIDIA Docker.
Modify docker_build.py and docker_run.py to your needs.
Build the image:
python scripts/docker_build.py
Start container:
python scripts/docker_run.py --nvidia_docker
cd ~/cliport
Use scripts/docker_run.py --headless
if you are on a headless machines like a remote server or cloud instance.
It depends on the complexity of the task. With 5-10 demonstrations the agent should start to do something useful, but it will often make mistakes by picking the wrong object. For robustness you probably need 50-100 demostrations. A good way to gauge how much data you might need is to setup a simulated version of the problem and evaluate agents trained with 1, 10, 100, and 1000 demonstrations.
This means either there is some sort of bias in the dataset that the agent is exploiting, or you don't have enough training data. Also make sure that the task is doable - if a referred attribute is barely legible in the input, then it's going to be hard for agent to figure out what you mean.
CLIPort does not predict height values. You can either: (1) come up with a heuristic based on the heightmap to determine the height position, or (2) train a simple MLP like in TransportNets-6DOF to predict z-values.
Note that CLIPort is not doing "object detection". CLIPort fine-tunes CLIP's representations to "detect actions" in SE(2). CLIP by itself has no understanding of actions or affordances; recognizing and localizing objects (e.g. detecting hammer) does not tell you anything about how to manipulate them (e.g. grasping the hammer by the handle).
The default settings should work well. Although recently, I have been playing around with using FiLM (Perez et. al, 2017) to fuse language features inspired by BC-0 (Jang et. al, 2021). Qualitatively, it seems like FiLM is better for reading text etc. but I haven't conducted a full quantitative analysis. Try it out yourself with train.agent=two_stream_clip_film_lingunet_lat_transporter
(non-residual FiLM).
Ideally, you should create a validation set with heldout instances and then choose the checkpoint with the lowest translation and rotation errors. You can also reuse the training instances but swap the language instructions with unseen goals.
By default, training samples are augmented with SE(2) rotations sampled from N(0, 60 deg)
. For tasks with rotational symmetries (like moving pieces on a chessboard) you need to be careful with this rotation augmentation parameter.
This work use code from the following open-source projects and datasets:
Original: https://github.com/google-research/ravens
License: Apache 2.0
Changes: All PyBullet tasks are directly adapted from the Ravens codebase. The original TransporterNets models were reimplemented in PyTorch.
Original: https://github.com/openai/CLIP
License: MIT
Changes: Minor modifications to CLIP-ResNet50 to save intermediate features for skip connections.
Original: Dataset
License: Creative Commons BY 4.0
Changes: Fixed center-of-mass (COM) to be geometric-center for selected objects.
Original: https://github.com/milesial/Pytorch-UNet/
License: GPL 3.0
Changes: Used as is in unet.py. Note: This part of the code is GPL 3.0.
CLIPort
@inproceedings{shridhar2021cliport,
title = {CLIPort: What and Where Pathways for Robotic Manipulation},
author = {Shridhar, Mohit and Manuelli, Lucas and Fox, Dieter},
booktitle = {Proceedings of the 5th Conference on Robot Learning (CoRL)},
year = {2021},
}
CLIP
@article{radford2021learning,
title={Learning transferable visual models from natural language supervision},
author={Radford, Alec and Kim, Jong Wook and Hallacy, Chris and Ramesh, Aditya and Goh, Gabriel and Agarwal, Sandhini and Sastry, Girish and Askell, Amanda and Mishkin, Pamela and Clark, Jack and others},
journal={arXiv preprint arXiv:2103.00020},
year={2021}
}
TransporterNets
@inproceedings{zeng2020transporter,
title={Transporter networks: Rearranging the visual world for robotic manipulation},
author={Zeng, Andy and Florence, Pete and Tompson, Jonathan and Welker, Stefan and Chien, Jonathan and Attarian, Maria and Armstrong, Travis and Krasin, Ivan and Duong, Dan and Sindhwani, Vikas and others},
booktitle={Proceedings of the 4th Conference on Robot Learning (CoRL)},
year= {2020},
}
Please file an issue with the issue tracker.