[Technical Report] TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

This repo is the official implementation of the paper TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale.

Main Results

Zero-shot Text-to-Video Retrieval

Zero-shot Action Recognition

Linear Probe

Instruction

Environment Setup

Before you start, run the following command to set up your Python environment.

pip install -r requirement.txt

Dataset Preparation

Dataset Splits

We have uploaded the dataset splits on Google Drive. Download it from this link and unzip it in the root directory.

Pre-training Datasets

Download YT-Temporal from here, and put the dataset under the folder data/YTTemporal.
Download WebVid-2M from here, and put the dataset under the folder data/WebVid.

Downstream Datasets

Text-to-Video Retrieval

Download MSR-VTT from here, and put the dataset under the folder data/msrvtt.
Download DiDeMo from here, and put the dataset under the folder data/didemo.
Download LSMDC from here, and put the dataset under the folder data/lsmdc.

Action Recognition

Download HMDB-51 from here, and put the dataset under the folder data/hmdb51.
Download UCF-101 from here, and put the dataset under the folder data/ucf101.
Download Kinetics-400 from here, and put the dataset under the folder data/k400.
Download SSV2 from here, and put the dataset under the folder data/SSV2.

Training and Evaluation

We use up to 80 NVIDIA V100 GPUs for pre-training. The detailed hyper-parameters can be found in the Appendix.

Pre-training

Download CLIP-B/32 and CLIP-B/16 weights from OpenAI’s official repo, and put them into CLIP/models.
Download OpenCLIP-H/14 weights from the official repo, and put it into OpenCLIP/models.

Run the following script to pre-train different models on the YT-Temporal dataset and WebVid dataset jointly.

bash scripts/train_dist_TVTSv2_ViT_B_32.sh # for ViT-B/32, no mask
bash scripts/train_dist_TVTSv2_ViT_B_16.sh # for ViT-B/16, mask 50%
bash scripts/train_dist_TVTSv2_ViT_H_14.sh # for ViT-H/14, mask 70%

Downstream Evaluation

We have released our pre-trained models on Google Drive in the following links to quickly reproduce the results reported in our paper.

TVTSv2_B_32: https://drive.google.com/file/d/1zNHgqioo-aRUwZXPyTDiRT2uaRrnk386/view?usp=sharing
TVTSv2_B_16: https://drive.google.com/file/d/1HKc7aGwMd5jhVaYztuY-jbmYqiz_wvWF/view?usp=sharing
TVTSv2_H_14: https://drive.google.com/file/d/1nxNSaQKm2jt9NSZ3eLnKx7ATTumV-6D5/view?usp=sharing

Download the pre-trained models and put them in the root directory. All zero-shot evaluation scripts are available on a single GPU. Try our powerful models now 😎!

# MSR-VTT Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_msrvtt_TVTSv2_ViT_H_14.sh # for ViT-H/14

# DiDeMo Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_didemo_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_didemo_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_didemo_TVTSv2_ViT_H_14.sh # for ViT-H/14

# LSMDC Zero-shot Text-to-Video Retrieval
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ret_lsmdc_TVTSv2_ViT_H_14.sh # for ViT-H/14

# HMDB-51 Zero-shot Action Recognition
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_hmdb51_TVTSv2_ViT_H_14.sh # for ViT-H/14

# UCF-101 Zero-shot Action Recognition
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_ucf101_TVTSv2_ViT_H_14.sh # for ViT-H/14

# Kinetics-400 Zero-shot Action Recognition
bash scripts/zero_recognition_k400_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_recognition_k400_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_recognition_k400_TVTSv2_ViT_H_14.sh # for ViT-H/14

# SSV2-MC Zero-shot Action Recognition
bash scripts/zero_ssv2_mc_TVTSv2_ViT_B_32.sh # for ViT-B/32
bash scripts/zero_ssv2_mc_TVTSv2_ViT_B_16.sh # for ViT-B/16
bash scripts/zero_ssv2_mc_TVTSv2_ViT_H_14.sh # for ViT-H/14

Tips: The performance may differ slightly (either higher or lower) from our papers due to hardware environment differences.

Video Feature Extraction

Our model is able to act as an independent video feature extractor. And we provide simple scripts for out-of-the-box usage. Have a try on your own video😜!

cd downstream
python feature_extraction_TVTSv2_B_32.py --video_path /path/to/video.mp4 # for ViT-B/32, feature shape: [1 x 512]
python feature_extraction_TVTSv2_B_16.py --video_path /path/to/video.mp4 # for ViT-B/16, feature shape: [1 x 512]
python feature_extraction_TVTSv2_H_14.py --video_path /path/to/video.mp4 # for ViT-H/14, feature shape: [1 x 1024]

Acknowledgement

The pre-training code is based on the official implementation of Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval.

Citation

If you find our work helps, please cite our paper.

@misc{zeng2023tvtsv2,
      title={TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale}, 
      author={Ziyun Zeng and Yixiao Ge and Zhan Tong and Xihui Liu and Shu-Tao Xia and Ying Shan},
      year={2023},
      eprint={2305.14173},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

License

This research paper makes references to some open-source projects. Credits are given to these projects. See License.txt for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

[Technical Report] TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Main Results

Zero-shot Text-to-Video Retrieval

Zero-shot Action Recognition

Linear Probe

Instruction

Environment Setup

Dataset Preparation

Dataset Splits

Pre-training Datasets

Downstream Datasets

Text-to-Video Retrieval

Action Recognition

Training and Evaluation

Pre-training

Downstream Evaluation

Video Feature Extraction

Acknowledgement

Citation

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

[Technical Report] TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

Main Results

Zero-shot Text-to-Video Retrieval

Zero-shot Action Recognition

Linear Probe

Instruction

Environment Setup

Dataset Preparation

Dataset Splits

Pre-training Datasets

Downstream Datasets

Text-to-Video Retrieval

Action Recognition

Training and Evaluation

Pre-training

Downstream Evaluation

Video Feature Extraction

Acknowledgement

Citation

License