Skip to content

Latest commit

 

History

History
82 lines (62 loc) · 8.92 KB

MODEL_ZOO.md

File metadata and controls

82 lines (62 loc) · 8.92 KB

PySlowFast Model Zoo and Baselines

Kinetics 400 and 600

architecture size crops x clips frame length x sample rate top1 top5 model config dataset
C2D R50 3 x 10 8 x 8 67.2 87.8 link Kinetics/c2/C2D_NOPOOL_8x8_R50 K400
I3D R50 3 x 10 8 x 8 73.5 90.8 link Kinetics/c2/I3D_8x8_R50 K400
I3D NLN R50 3 x 10 8 x 8 74.0 91.1 link Kinetics/c2/I3D_NLN_8x8_R50 K400
Slow R50 3 x 10 4 x 16 72.7 90.3 link Kinetics/c2/SLOW_4x16_R50 K400
Slow R50 3 x 10 8 x 8 74.8 91.6 link Kinetics/c2/SLOW_8x8_R50 K400
SlowFast R50 3 x 10 4 x 16 75.6 92.0 link Kinetics/c2/SLOWFAST_4x16_R50 K400
SlowFast R50 3 x 10 8 x 8 77.0 92.6 link Kinetics/c2/SLOWFAST_8x8_R50 K400
MViTv1 B-Conv 1 x 5 16 x 4 78.4 93.5 link Kinetics/MVIT_B_16x4_CONV K400
rev-MViT B-Conv 1 x 5 16 x 4 78.4 93.4 link Kinetics/REV_MVIT_B_16x4_CONV K400
MViTv1 B-Conv 1 x 5 32 x 3 80.4 94.8 link Kinetics/MVIT_B_32x3_CONV K400
MViTv1 B-Conv 1 x 5 32 x 3 83.9 96.5 link Kinetics/MVIT_B_32x3_CONV_K600 K600
MViTv2 S 1 x 5 16 x 4 81.0 94.6 link Kinetics/MVITv2_S_16x4 K400
MViTv2 B 1 x 5 32 x 3 82.9 95.7 link Kinetics/MVITv2_B_32x3 K400

X3D models (details in projects/x3d)

architecture size pretrain frame length x sample rate top1 10-view top1 30-view parameters (M) FLOPs (G) model config
X3D XS - 4 x 12 68.7 69.5 3.8 0.60 link Kinetics/X3D_XS
X3D S - 13 x 6 73.1 73.5 3.8 1.96 link Kinetics/X3D_S
X3D M - 16 x 5 75.1 76.2 3.8 4.73 link Kinetics/X3D_M
X3D L - 16 x 5 76.9 77.5 6.2 18.37 link Kinetics/X3D_L

AVA

architecture size Pretrain Model frame length x sample rate MAP AVA version model
Slow R50 Kinetics 400 4 x 16 19.5 2.2 link
SlowFast R101 Kinetics 600 8 x 8 28.2 2.1 link
SlowFast R101 Kinetics 600 8 x 8 29.1 2.2 link
SlowFast R101 Kinetics 600 16 x 8 29.4 2.2 link

Multigrid Training

Update June, 2020: In the following we provide (reimplemented) models from "A Multigrid Method for Efficiently Training Video Models " paper. The multigrid method trains about 3-6x faster than the original training on multiple datasets. See projects/multigrid for more information. The following provides models, results, and example config files.

Kinetics:

architecture size pretrain frame length x sample rate training top1 top5 model config
SlowFast R50 - 8 x 8 Standard 76.8 92.7 link Kinetics/SLOWFAST_8x8_R50_stepwise
SlowFast R50 - 8 x 8 Multigrid 76.6 92.7 link Kinetics/SLOWFAST_8x8_R50_stepwise_multigrid

(Here we use stepwise learning rate schedule.)

Something-Something V2:

architecture size pretrain frame length x sample rate training top1 top5 model config
SlowFast R50 Kinetics 400 16 x 8 Standard 63.0 88.5 link SSv2/SLOWFAST_16x8_R50
SlowFast R50 Kinetics 400 16 x 8 Multigrid 63.5 88.7 link SSv2/SLOWFAST_16x8_R50_multigrid

Charades

architecture size pretrain frame length x sample rate training mAP model config
SlowFast R50 Kinetics 400 16 x 8 Standard 38.9 link SSv2/SLOWFAST_16x8_R50
SlowFast R50 Kinetics 400 16 x 8 Multigrid 38.6 link SSv2/SLOWFAST_16x8_R50_multigrid

ImageNet

We also release the imagenet pretrained model if finetuning from ImageNet is preferred. The reported accuracy is obtained by center crop testing on the validation set.

architecture size Top1 Top5 model Config
ResNet R50 76.4 93.2 link ImageNet/RES_R50
MVIT B-16-Conv 82.9 96.3 link ImageNet/MVIT_B_16_CONV
rev-VIT Small 79.9 94.9 link ImageNet/REV_VIT_S.yaml
rev-VIT Base 81.8 95.6 link ImageNet/REV_VIT_B.yaml
rev-MVIT Base 82.9* 96.3 link ImageNet/REV_MVIT_B_16_CONV.yaml

*please refer to Reversible Model Zoo.

PyTorchVideo

We support and benchmark PyTorchVideo models and datasets in PySlowFast. See projects/pytorchvideo for more information about PyTorchVideo Model Zoo.