You Only Look at One Sequence (YOLOS) (paper, code) is a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. With pre-training on the ImageNet-1k dataset and fine-tuning on the COCO dataset, the transfer learning performance of YOLOS reflected on COCO object detection dataset can serve as a challenging transfer learning benchmark to evaluate different (label-supervised or self-supervised) pre-training strategies for ViT.
EVA-02 (paper, code) is a series of visual pre-training models based on the ViT architecture and Masked Image Modeling (MIM) pre-training strategy. After fine-tuning for downstream tasks, the EVA-02 model demonstrates superior performance compared to previous models in various downstream tasks such as image classification and object detection.
This project applies the YOLOS method to EVA-02 models and evaluates their performance on the VOC2007 dataset to reveal the transferability of EVA-02 pretraining models.
Model | Params | Pre-train Epochs | Init Weight | Fine-tune Epochs | Eval Size | YOLOS Checkpoint / Log | AP @ VOC2007 test |
---|---|---|---|---|---|---|---|
VOC-YOLOS-Ti |
6M | 300 | DeiT-tiny | 300 | 512 | checkpoint / Log | 23.9 |
VOC-YOLOS-S |
23M | 300 | DeiT-small | 150 | 512 | checkpoint / Log | 31.1 |
VOC-YOLOS-EVA-Ti |
6M | 240+100 | EVA02-tiny | 300 | 512 | checkpoint / Log | 31.9 |
VOC-YOLOS-EVA-S |
23M | 240+100 | EVA02-small | 150 | 512 | checkpoint / Log | 42.0 |
Notes:
- The
Pre-train Epochs
ofVOC-YOLOS-EVA
is240+100
, which means 240 MIM pre-training epochs and 100 IN-1K fine-tuned epochs. In other words, we use IN-1K fine-tuned EVA-02 weights as initial checkpoint. The reason why we don't choose to directly use MIM pre training weights as the initial weights is due to the small size of the VOC2007 dataset, which makes it difficult to start training from MIM models that have never seen real images before. Subsequent experiments have also proven this point: For the Tiny model, the performance difference between the two is not significant. But for Small model, model trained from MIM weights performs poorly. - For EVA models, We interpolate the kernel size of
patch_embed
from14x14
to16x16
. This is useful for object detection, instance segmentation & semantic segmentation tasks. - The comparison of these results may not be fair, as the EVA model uses more data during the pre-training process(IN-21K).
- Tiny models
Model | Params (adjustable) | Init Weight | finetune type | Fully adjustable layers | Log | AP @ VOC2007 test |
---|---|---|---|---|---|---|
VOC-YOLOS-Ti |
6M | DeiT-tiny | full | 12 | Log | 23.9 |
VOC-YOLOS-EVA-Ti |
6M | eva02_Ti_pt_in21k_ft_in1k_p14 | full | 12 | Log | 31.9 |
VOC-YOLOS-Ti-0 |
3.7M | DeiT-tiny | ffn | 0 | Log | 9.4 |
VOC-YOLOS-EVA-Ti-0 |
3.7M | eva02_Ti_pt_in21k_ft_in1k_p14 | ffn | 0 | Log | 10.9 |
VOC-YOLOS-EVA-Ti-1 |
4.2M | eva02_Ti_pt_in21k_ft_in1k_p14 | ffn | 1 | Log | 11.6 |
VOC-YOLOS-EVA-Ti-2 |
4.6M | eva02_Ti_pt_in21k_ft_in1k_p14 | ffn | 2 | Log | 12.4 |
VOC-YOLOS-EVA-Ti-3 |
5.1M | eva02_Ti_pt_in21k_ft_in1k_p14 | ffn | 3 | Log | 16.2 |
- Small models
Model | Params (adjustable) | Init Weight | finetune type | Fully adjustable layers | Log | AP @ VOC2007 test |
---|---|---|---|---|---|---|
VOC-YOLOS-S |
23M | DeiT-small | full | 12 | Log | 31.1 |
VOC-YOLOS-EVA-S |
23M | eva02_S_pt_in21k_ft_in1k_p14 | full | 12 | Log | 42.0 |
VOC-YOLOS-S-0 |
15M | DeiT-small | ffn | 0 | Log | 12.2 |
VOC-YOLOS-EVA-S-0 |
15M | eva02_S_pt_in21k_ft_in1k_p14 | ffn | 0 | Log | 21.0 |
VOC-YOLOS-EVA-S-MIM-0 |
15M | eva02_S_pt_in21k_p14 | ffn | 0 | Log | 23.0 |
VOC-YOLOS-EVA-S-1 |
17M | eva02_S_pt_in21k_ft_in1k_p14 | ffn | 1 | Log | 23.4 |
VOC-YOLOS-EVA-S-MIM-1 |
17M | eva02_S_pt_in21k_p14 | ffn | 1 | Log | 24.0 |
VOC-YOLOS-EVA-S-2 |
18M | eva02_S_pt_in21k_ft_in1k_p14 | ffn | 2 | Log | 25.8 |
VOC-YOLOS-EVA-S-MIM-2 |
18M | eva02_S_pt_in21k_p14 | ffn | 2 | Log | 30.7 |
VOC-YOLOS-EVA-S-3 |
20M | eva02_S_pt_in21k_ft_in1k_p14 | ffn | 3 | Log | 32.6 |
VOC-YOLOS-EVA-S-MIM-3 |
20M | eva02_S_pt_in21k_p14 | ffn | 3 | Log | 35.4 |
VOC-YOLOS-EVA-S-attn-0 |
8M | eva02_S_pt_in21k_p14 | attn | 0 | checkpoint / Log | 42.4 |
Explanations +
Please reference to Requirement of YOLOS here to build the environment.
Further, you also need to install timm and einops for EVA model:
pip install timm einops
We use VOC2007 trainval to train and VOC2007 test to eval.
Download and extract Pascal VOC 2007 images and annotations:
# Download the data.
cd $HOME/data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
# Extract the data.
tar -xvf VOCtrainval_06-Nov-2007.tar
tar -xvf VOCtest_06-Nov-2007.tar
Now you should see VOCdevkit folder.
Then run voc2coco.py to convert VOC annotations to COCO format.
python voc2coco.py /path/to/VOCdevkit
Now you should see voc_train.json and voc_val.json.
We expect the dataset directory structure to be the following:
path/to/dataset/
annotations/
voc_train.json
voc_val.json
images/
train/ # VOC 2007 trainval images
val/ # VOC 2007 test images
Before finetuning on VOC2007, you need download the pre-trained model.
To train the original VOC-YOLOS-Ti
model on VOC2007, run this command:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 2 --lr 2.5e-5 --epochs 300 --backbone_name tiny --pre_trained path/to/deit-tiny.pth --eval_size 512 --init_pe_size 608 800 --output_dir /output/path/box_model
To train the original VOC-YOLOS-S
model on VOC2007, run this command:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --lr 2.5e-5 --epochs 150 --backbone_name small --pre_trained path/to/deit-small-300epoch.pth --eval_size 512 --init_pe_size 512 864 --mid_pe_size 512 864 --output_dir /output/path/box_model
To train the VOC-YOLOS-EVA-Ti
model on VOC2007, run this command:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 2 --lr 2.5e-5 --epochs 300 --model_name eva --backbone_name tiny --pre_trained path/to/eva02_Ti_pt_in21k_ft_in1k_p14.pt --eval_size 512 --init_pe_size 608 800 --output_dir /output/path/box_model
To train the EVA-YOLOS-EVA-S
model on VOC2007, run this command:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --lr 2.5e-5 --epochs 150 --model_name eva --backbone_name small --pre_trained path/to/eva02_S_pt_in21k_ft_in1k_p14.pt --eval_size 512 --init_pe_size 608 800 --output_dir /output/path/box_model
To apply attention partial finetune on EVA-YOLOS-EVA-S-MIM
model on VOC2007, run this command:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --lr 2.5e-5 --epochs 150 --model_name eva --backbone_name small_mim --use_partial_finetune --partial_finetune_type attn --pre_trained path/to/eva02_S_pt_in21k_p14.pt --eval_size 512 --init_pe_size 608 800 --output_dir /output/path/box_model
To evaluate VOC-YOLOS-Ti
model on VOC2007 test, run:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 2 --backbone_name tiny --eval --eval_size 512 --init_pe_size 608 800 --resume path/to/voc_yolos_ti.pth
To evaluate VOC-YOLOS-S
model on VOC2007 test, run:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --backbone_name small --eval --eval_size 512 --init_pe_size 512 864 --mid_pe_size 512 864 --resume path/to/voc_yolos_s.pth
To evaluate VOC-YOLOS-EVA-Ti
model on VOC2007 test, run:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 2 --model_name eva --backbone_name tiny --eval --eval_size 512 --init_pe_size 608 800 --resume path/to/voc_yolos_eva_ti.pth
To evaluate VOC-YOLOS-EVA-S
model on VOC2007 test, run:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --model_name eva --backbone_name small --eval --eval_size 512 --init_pe_size 608 800 --resume path/to/voc_yolos_eva_s.pth
To evaluate VOC-YOLOS-EVA-S-attn-0
model on VOC2007 test, run:
python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --model_name eva --backbone_name small_mim --eval --eval_size 512 --init_pe_size 608 800 --resume path/to/voc_eva_s_mim_frozen_attn_0.pth