by Qianyu Zhou, Xiangtai Li, Lu He, Yibo Yang, Guangliang Cheng, Yunhai Tong, Lizhuang Ma, Dacheng Tao
(TPAMI 2023) TransVOD:End-to-End Video Object Detection with Spatial-Temporal Transformers.
🔔 We are happy to announce that TransVOD was accepted by IEEE TPAMI.
🔔 We are happy to announce that our method is the first work that achieves 90% mAP on ImageNet VID dataset.
- (December 2022) Checkpoints of pretrained models are released.
- (December 2022) Code of TransVOD Lite are released.
If you find TransVOD useful in your research, please consider citing:
@article{zhou2022transvod,
author={Zhou, Qianyu and Li, Xiangtai and He, Lu and Yang, Yibo and Cheng, Guangliang and Tong, Yunhai and Ma, Lizhuang and Tao, Dacheng}, journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
title={TransVOD: End-to-End Video Object Detection With Spatial-Temporal Transformers},
year={2022},
pages={1-16},
doi={10.1109/TPAMI.2022.3223955}}
@inproceedings{he2021end,
title={End-to-End Video Object Detection with Spatial-Temporal Transformers},
author={He, Lu and Zhou, Qianyu and Li, Xiangtai and Niu, Li and Cheng, Guangliang and Li, Xiao and Liu, Wenxuan and Tong, Yunhai and Ma, Lizhuang and Zhang, Liqing},
booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
pages={1507--1516},
year={2021}
}
Our proposed method TransVOD Lite, achieving the best tradeoff between the speed and accuracy with different backbones. SwinB, SwinS and SwinT mean Swin Base, Small and Tiny.
Note:
- All models of TransVOD are trained with pre-trained weights on COCO dataset.
The codebase is built on top of Deformable DETR and TransVOD.
-
Linux, CUDA>=9.2, GCC>=5.4
-
Python>=3.7
We recommend you to use Anaconda to create a conda environment:
conda create -n TransVOD python=3.7 pip
Then, activate the environment:
conda activate TransVOD
-
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here
For example, if your CUDA version is 9.2, you could install pytorch and torchvision as following:
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
-
Other requirements
pip install -r requirements.txt
-
Build MultiScaleDeformableAttention
cd ./models/ops sh ./make.sh
Below, we provide checkpoints, training logs and inference logs of TransVOD Lite for different backbones.
DownLoad Link of Baidu Netdisk (password:26xc)
- Please download ILSVRC2015 DET and ILSVRC2015 VID dataset from here. Then we covert jsons of two datasets by using the code. You can directly download the joint json file json of the two datasets that we have already converted. After that, we recommend to symlink the path to the datasets to datasets/. And the path structure should be as follows:
code_root/
└── data/
└── vid/
├── Data
├── VID/
└── DET/
└── annotations/
├── imagenet_vid_train.json
├── imagenet_vid_train_joint_30.json
└── imagenet_vid_val.json
We use Swin Transformer as the network backbone. We train our TransVOD with Swin-base as backbone as following:
- Train SingleFrameBaseline. You can download COCO pretrained weights from the aforementioned link.
GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 swinb $2 configs/swinb_train_single.sh
- Train TransVOD Lite. Using the model weights of SingleBaseline as the resume model.
GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 swinb $2 configs/swinb_train_multi.sh
If you are using slurm cluster, you can simply run the following command to train on 1 node with 8 GPUs:
GPUS_PER_NODE=8 ./tools/run_dist_slurm.sh <partition> swinb 8 configs/swinb_train_multi.sh
You can get the config file and pretrained model of TransVOD (the link is in "Checkpoint" session), then put the pretrained_model into correponding folder.
code_root/
└── exps/
└── our_models/
├── COCO_pretrained_model
├── exps_single
└── exps_multi
And then run following command to evaluate it on ImageNET VID validation set:
GPUS_PER_NODE=8 ./tools/run_dist_launch.sh $1 eval_swinb $2 configs/swinb_eval_multi.sh
This project is based on the following open-source projects. We thank their authors for making the source code publically available.
This project is released under the Apache License 2.0, while some specific features in this repository are with other licenses. Please refer to LICENSES.md for the careful check, if you are using our code for commercial matters.