PySlowFast Model Zoo and Baselines
architecture
size
crops x clips
frame length x sample rate
top1
top5
model
config
dataset
C2D
R50
3 x 10
8 x 8
67.2
87.8
link
Kinetics/c2/C2D_NOPOOL_8x8_R50
K400
I3D
R50
3 x 10
8 x 8
73.5
90.8
link
Kinetics/c2/I3D_8x8_R50
K400
I3D NLN
R50
3 x 10
8 x 8
74.0
91.1
link
Kinetics/c2/I3D_NLN_8x8_R50
K400
Slow
R50
3 x 10
4 x 16
72.7
90.3
link
Kinetics/c2/SLOW_4x16_R50
K400
Slow
R50
3 x 10
8 x 8
74.8
91.6
link
Kinetics/c2/SLOW_8x8_R50
K400
SlowFast
R50
3 x 10
4 x 16
75.6
92.0
link
Kinetics/c2/SLOWFAST_4x16_R50
K400
SlowFast
R50
3 x 10
8 x 8
77.0
92.6
link
Kinetics/c2/SLOWFAST_8x8_R50
K400
MViTv1
B-Conv
1 x 5
16 x 4
78.4
93.5
link
Kinetics/MVIT_B_16x4_CONV
K400
rev-MViT
B-Conv
1 x 5
16 x 4
78.4
93.4
link
Kinetics/REV_MVIT_B_16x4_CONV
K400
MViTv1
B-Conv
1 x 5
32 x 3
80.4
94.8
link
Kinetics/MVIT_B_32x3_CONV
K400
MViTv1
B-Conv
1 x 5
32 x 3
83.9
96.5
link
Kinetics/MVIT_B_32x3_CONV_K600
K600
MViTv2
S
1 x 5
16 x 4
81.0
94.6
link
Kinetics/MVITv2_S_16x4
K400
MViTv2
B
1 x 5
32 x 3
82.9
95.7
link
Kinetics/MVITv2_B_32x3
K400
X3D models (details in projects/x3d)
architecture
size
pretrain
frame length x sample rate
top1 10-view
top1 30-view
parameters (M)
FLOPs (G)
model
config
X3D
XS
-
4 x 12
68.7
69.5
3.8
0.60
link
Kinetics/X3D_XS
X3D
S
-
13 x 6
73.1
73.5
3.8
1.96
link
Kinetics/X3D_S
X3D
M
-
16 x 5
75.1
76.2
3.8
4.73
link
Kinetics/X3D_M
X3D
L
-
16 x 5
76.9
77.5
6.2
18.37
link
Kinetics/X3D_L
Update June, 2020: In the following we provide (reimplemented) models from "A Multigrid Method for Efficiently Training Video Models
" paper. The multigrid method trains about 3-6x faster than the original training on multiple datasets. See projects/multigrid for more information. The following provides models, results, and example config files.
architecture
size
pretrain
frame length x sample rate
training
top1
top5
model
config
SlowFast
R50
-
8 x 8
Standard
76.8
92.7
link
Kinetics/SLOWFAST_8x8_R50_stepwise
SlowFast
R50
-
8 x 8
Multigrid
76.6
92.7
link
Kinetics/SLOWFAST_8x8_R50_stepwise_multigrid
(Here we use stepwise learning rate schedule.)
architecture
size
pretrain
frame length x sample rate
training
top1
top5
model
config
SlowFast
R50
Kinetics 400
16 x 8
Standard
63.0
88.5
link
SSv2/SLOWFAST_16x8_R50
SlowFast
R50
Kinetics 400
16 x 8
Multigrid
63.5
88.7
link
SSv2/SLOWFAST_16x8_R50_multigrid
architecture
size
pretrain
frame length x sample rate
training
mAP
model
config
SlowFast
R50
Kinetics 400
16 x 8
Standard
38.9
link
SSv2/SLOWFAST_16x8_R50
SlowFast
R50
Kinetics 400
16 x 8
Multigrid
38.6
link
SSv2/SLOWFAST_16x8_R50_multigrid
We also release the imagenet pretrained model if finetuning from ImageNet is preferred. The reported accuracy is obtained by center crop testing on the validation set.
architecture
size
Top1
Top5
model
Config
ResNet
R50
76.4
93.2
link
ImageNet/RES_R50
MVIT
B-16-Conv
82.9
96.3
link
ImageNet/MVIT_B_16_CONV
rev-VIT
Small
79.9
94.9
link
ImageNet/REV_VIT_S.yaml
rev-VIT
Base
81.8
95.6
link
ImageNet/REV_VIT_B.yaml
rev-MVIT
Base
82.9*
96.3
link
ImageNet/REV_MVIT_B_16_CONV.yaml
*please refer to Reversible Model Zoo .
We support and benchmark PyTorchVideo models and datasets in PySlowFast. See projects/pytorchvideo for more information about PyTorchVideo Model Zoo.