GitHub - snap-research/MMVID: [CVPR 2022] Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning

MMVID
_{Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning (CVPR 2022)}

Project | arXiv | PDF | Dataset

Generated Videos on Multimodal VoxCeleb

This repo contains the code for training and testing, models, and data for MMVID.

Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning
Ligong Han, Jian Ren, Hsin-Ying Lee, Francesco Barbieri, Kyle Olszewski, Shervin Minaee, Dimitris Metaxas, Sergey Tulyakov
Snap Inc., Rutgers University
CVPR 2022

MMVID Code

CLIP model

Download OpenAI's pretrained CLIP model and place it under ./ (or any other directory that is consistent with arg --openai_clip_model_path),

wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

VQGAN

Code for finetuning VQGAN models is provided in this repo.

Multimodal VoxCeleb

For testing, please download pretrained models and change the path for --dalle_path in the scripts.

For quantitative evaluation, append --eval_mode eval to each testing command. Output log directory can be changed by appending --name_suffix _fvd to add suffix (example here).

Text-to-Video

Training:

bash scripts/mmvoxceleb/text_to_video/train.sh

Testing:

bash scripts/mmvoxceleb/text_to_video/test.sh

For Quantitative Evaluation (FVD and PRD):

bash scripts/mmvoxceleb/text_to_video/evaluation.sh

Text Augmentation

Text augmentation for better training. To enable using a pretrained RoBERTa model, append --fixed_language_model roberta-large to the training/testing command. Note that this feature is only experimental and is not very robust.

To enable text dropout, append --drop_sentence to the training command. Text dropout is also compatible with using a RoBERTa. We observed that text dropout genrally improves diversity in the generated videos.

Training:

bash scripts/mmvoxceleb/text_augement/train.sh

Testing:

bash scripts/mmvoxceleb/text_augement/test.sh

Text and Mask

Training:

bash scripts/mmvoxceleb/text_and_mask/train.sh

Testing:

bash scripts/mmvoxceleb/text_and_mask/test.sh

Text and Drawing

Training:

bash scripts/mmvoxceleb/text_and_drawing/train.sh

Testing:

bash scripts/mmvoxceleb/text_and_drawing/test.sh

Drawing and Mask

Training:

bash scripts/mmvoxceleb/drawing_and_mask/train.sh

Testing:

bash scripts/mmvoxceleb/drawing_and_mask/test.sh

Image and Mask

Training:

bash scripts/mmvoxceleb/image_and_mask/train.sh

Testing:

bash scripts/mmvoxceleb/image_and_mask/test.sh

Text and Partial Image

Training:

bash scripts/mmvoxceleb/image_and_mask/train.sh

Testing:

bash scripts/mmvoxceleb/image_and_mask/test.sh

Image and Video

Training:

bash scripts/mmvoxceleb/image_and_mask/train.sh

Testing:

bash scripts/mmvoxceleb/image_and_mask/test.sh

Pretrained Models

Pretrained models are provided here.

Multimodal VoxCeleb

	Weight	FVD
VQGAN (vae)	ckpt	-
VQGAN (cvae for image conditiong)	ckpt	-
Text-to-Video	pt	59.46
Text-to-Video (ARTV)	pt	70.95
Text and Mask	pt	-
Text and Drawing	pt	-
Drawing and Mask	pt	-
Image and Mask	pt	-
Text and Partial Image	pt	-
Image and Video	pt	-
Text-Augmentation	pt	-

Multimodal VoxCeleb Dataset

Multimodal VoxCeleb Dataset has a total of 19,522 videos with 3,437 various interview situations (453 people). Please see details about how to prepare the dataset in mm_vox_celeb/README.md. Preprocessed data is also available here.

Acknowledgement

This code is heavily based on DALLE-PyTorch and uses CLIP, Taming Transformer, Precision Recall Distribution, Frechet Video Distance, Facenet-PyTorch, Face Parsing, and Unpaired Portrait Drawing.

The authors thank everyone who makes their code and models available.

Citation

If our code, data, or models help your work, please cite our paper:

@inproceedings{han2022show,
title={Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning},
author={Han, Ligong and Ren, Jian and Lee, Hsin-Ying and Barbieri, Francesco and Olszewski, Kyle and Minaee, Shervin and Metaxas, Dimitris and Tulyakov, Sergey},
booktitle={CVPR},
year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
frechet_video_distance		frechet_video_distance
images		images
mm_vox_celeb		mm_vox_celeb
mmvid_pytorch		mmvid_pytorch
precision_recall_distributions		precision_recall_distributions
scripts/mmvoxceleb		scripts/mmvoxceleb
taming		taming
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMVID
_{Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning (CVPR 2022)}

Project | arXiv | PDF | Dataset

MMVID Code

CLIP model

VQGAN

Multimodal VoxCeleb

Training:

Testing:

For Quantitative Evaluation (FVD and PRD):

Training:

Testing:

Training:

Testing:

Training:

Testing:

Training:

Testing:

Training:

Testing:

Training:

Testing:

Training:

Testing:

Pretrained Models

Multimodal VoxCeleb

Multimodal VoxCeleb Dataset

Acknowledgement

Citation

About

Releases

Packages

Languages

License

snap-research/MMVID

Folders and files

Latest commit

History

Repository files navigation

MMVIDShow Me What and Tell Me How: Video Synthesis via Multimodal Conditioning (CVPR 2022)

Project | arXiv | PDF | Dataset

MMVID Code

CLIP model

VQGAN

Multimodal VoxCeleb

Training:

Testing:

For Quantitative Evaluation (FVD and PRD):

Training:

Testing:

Training:

Testing:

Training:

Testing:

Training:

Testing:

Training:

Testing:

Training:

Testing:

Training:

Testing:

Pretrained Models

Multimodal VoxCeleb

Multimodal VoxCeleb Dataset

Acknowledgement

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

MMVID
_{Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning (CVPR 2022)}

Packages