This repo contains the code for training and testing, models, and data for MMVID.
Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
Ligong Han, Jian Ren, Hsin-Ying Lee, Francesco Barbieri, Kyle Olszewski, Shervin Minaee, Dimitris Metaxas, Sergey Tulyakov
Snap Inc., Rutgers University
CVPR 2022
Download OpenAI's pretrained CLIP model and place it under ./
(or any other directory that is consistent with arg --openai_clip_model_path
),
wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt
Code for finetuning VQGAN models is provided in this repo.
For testing, please download pretrained models and change the path for --dalle_path
in the scripts.
For quantitative evaluation, append --eval_mode eval
to each testing command. Output log directory can be changed by appending --name_suffix _fvd
to add suffix (example here).
Text-to-Video
bash scripts/mmvoxceleb/text_to_video/train.sh
bash scripts/mmvoxceleb/text_to_video/test.sh
bash scripts/mmvoxceleb/text_to_video/evaluation.sh
Text Augmentation
Text augmentation for better training. To enable using a pretrained RoBERTa model, append --fixed_language_model roberta-large
to the training/testing command. Note that this feature is only experimental and is not very robust.
To enable text dropout, append --drop_sentence
to the training command. Text dropout is also compatible with using a RoBERTa. We observed that text dropout genrally improves diversity in the generated videos.
bash scripts/mmvoxceleb/text_augement/train.sh
bash scripts/mmvoxceleb/text_augement/test.sh
Text and Mask
bash scripts/mmvoxceleb/text_and_mask/train.sh
bash scripts/mmvoxceleb/text_and_mask/test.sh
Text and Drawing
bash scripts/mmvoxceleb/text_and_drawing/train.sh
bash scripts/mmvoxceleb/text_and_drawing/test.sh
Drawing and Mask
bash scripts/mmvoxceleb/drawing_and_mask/train.sh
bash scripts/mmvoxceleb/drawing_and_mask/test.sh
Image and Mask
bash scripts/mmvoxceleb/image_and_mask/train.sh
bash scripts/mmvoxceleb/image_and_mask/test.sh
Text and Partial Image
bash scripts/mmvoxceleb/image_and_mask/train.sh
bash scripts/mmvoxceleb/image_and_mask/test.sh
Image and Video
bash scripts/mmvoxceleb/image_and_mask/train.sh
bash scripts/mmvoxceleb/image_and_mask/test.sh
Pretrained models are provided here.
Weight | FVD | |
---|---|---|
VQGAN (vae) | ckpt | - |
VQGAN (cvae for image conditiong) | ckpt | - |
Text-to-Video | pt | 59.46 |
Text-to-Video (ARTV) | pt | 70.95 |
Text and Mask | pt | - |
Text and Drawing | pt | - |
Drawing and Mask | pt | - |
Image and Mask | pt | - |
Text and Partial Image | pt | - |
Image and Video | pt | - |
Text-Augmentation | pt | - |
Multimodal VoxCeleb Dataset has a total of 19,522 videos with 3,437 various interview situations (453 people). Please see details about how to prepare the dataset in mm_vox_celeb/README.md
. Preprocessed data is also available here.
This code is heavily based on DALLE-PyTorch and uses CLIP, Taming Transformer, Precision Recall Distribution, Frechet Video Distance, Facenet-PyTorch, Face Parsing, and Unpaired Portrait Drawing.
The authors thank everyone who makes their code and models available.
If our code, data, or models help your work, please cite our paper:
@inproceedings{han2022show,
title={Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning},
author={Han, Ligong and Ren, Jian and Lee, Hsin-Ying and Barbieri, Francesco and Olszewski, Kyle and Minaee, Shervin and Metaxas, Dimitris and Tulyakov, Sergey},
booktitle={CVPR},
year={2022}
}