YOLOS with EVA-02

Introduction

You Only Look at One Sequence (YOLOS) (paper, code) is a series of object detection models based on the vanilla Vision Transformer with the fewest possible modifications, region priors, as well as inductive biases of the target task. With pre-training on the ImageNet-1k dataset and fine-tuning on the COCO dataset, the transfer learning performance of YOLOS reflected on COCO object detection dataset can serve as a challenging transfer learning benchmark to evaluate different (label-supervised or self-supervised) pre-training strategies for ViT.

EVA-02 (paper, code) is a series of visual pre-training models based on the ViT architecture and Masked Image Modeling (MIM) pre-training strategy. After fine-tuning for downstream tasks, the EVA-02 model demonstrates superior performance compared to previous models in various downstream tasks such as image classification and object detection.

This project applies the YOLOS method to EVA-02 models and evaluates their performance on the VOC2007 dataset to reveal the transferability of EVA-02 pretraining models.

Results

Model	Params	Pre-train Epochs	Init Weight	Fine-tune Epochs	Eval Size	YOLOS Checkpoint / Log	AP @ VOC2007 test
`VOC-YOLOS-Ti`	6M	300	DeiT-tiny	300	512	checkpoint / Log	23.9
`VOC-YOLOS-S`	23M	300	DeiT-small	150	512	checkpoint / Log	31.1
`VOC-YOLOS-EVA-Ti`	6M	240+100	EVA02-tiny	300	512	checkpoint / Log	31.9
`VOC-YOLOS-EVA-S`	23M	240+100	EVA02-small	150	512	checkpoint / Log	42.0

Notes:

The Pre-train Epochs of VOC-YOLOS-EVA is 240+100 , which means 240 MIM pre-training epochs and 100 IN-1K fine-tuned epochs. In other words, we use IN-1K fine-tuned EVA-02 weights as initial checkpoint. The reason why we don't choose to directly use MIM pre training weights as the initial weights is due to the small size of the VOC2007 dataset, which makes it difficult to start training from MIM models that have never seen real images before. Subsequent experiments have also proven this point: For the Tiny model, the performance difference between the two is not significant. But for Small model, model trained from MIM weights performs poorly.
For EVA models, We interpolate the kernel size of patch_embed from 14x14 to 16x16. This is useful for object detection, instance segmentation & semantic segmentation tasks.
The comparison of these results may not be fair, as the EVA model uses more data during the pre-training process(IN-21K).

Partial finetune results

Tiny models

Model	Params (adjustable)	Init Weight	finetune type	Fully adjustable layers	Log	AP @ VOC2007 test
`VOC-YOLOS-Ti`	6M	DeiT-tiny	full	12	Log	23.9
`VOC-YOLOS-EVA-Ti`	6M	eva02_Ti_pt_in21k_ft_in1k_p14	full	12	Log	31.9
`VOC-YOLOS-Ti-0`	3.7M	DeiT-tiny	ffn	0	Log	9.4
`VOC-YOLOS-EVA-Ti-0`	3.7M	eva02_Ti_pt_in21k_ft_in1k_p14	ffn	0	Log	10.9
`VOC-YOLOS-EVA-Ti-1`	4.2M	eva02_Ti_pt_in21k_ft_in1k_p14	ffn	1	Log	11.6
`VOC-YOLOS-EVA-Ti-2`	4.6M	eva02_Ti_pt_in21k_ft_in1k_p14	ffn	2	Log	12.4
`VOC-YOLOS-EVA-Ti-3`	5.1M	eva02_Ti_pt_in21k_ft_in1k_p14	ffn	3	Log	16.2

Small models

Model	Params (adjustable)	Init Weight	finetune type	Fully adjustable layers	Log	AP @ VOC2007 test
`VOC-YOLOS-S`	23M	DeiT-small	full	12	Log	31.1
`VOC-YOLOS-EVA-S`	23M	eva02_S_pt_in21k_ft_in1k_p14	full	12	Log	42.0
`VOC-YOLOS-S-0`	15M	DeiT-small	ffn	0	Log	12.2
`VOC-YOLOS-EVA-S-0`	15M	eva02_S_pt_in21k_ft_in1k_p14	ffn	0	Log	21.0
`VOC-YOLOS-EVA-S-MIM-0`	15M	eva02_S_pt_in21k_p14	ffn	0	Log	23.0
`VOC-YOLOS-EVA-S-1`	17M	eva02_S_pt_in21k_ft_in1k_p14	ffn	1	Log	23.4
`VOC-YOLOS-EVA-S-MIM-1`	17M	eva02_S_pt_in21k_p14	ffn	1	Log	24.0
`VOC-YOLOS-EVA-S-2`	18M	eva02_S_pt_in21k_ft_in1k_p14	ffn	2	Log	25.8
`VOC-YOLOS-EVA-S-MIM-2`	18M	eva02_S_pt_in21k_p14	ffn	2	Log	30.7
`VOC-YOLOS-EVA-S-3`	20M	eva02_S_pt_in21k_ft_in1k_p14	ffn	3	Log	32.6
`VOC-YOLOS-EVA-S-MIM-3`	20M	eva02_S_pt_in21k_p14	ffn	3	Log	35.4
`VOC-YOLOS-EVA-S-attn-0`	8M	eva02_S_pt_in21k_p14	attn	0	checkpoint / Log	42.4

Explanations +

Requirement

Please reference to Requirement of YOLOS here to build the environment.

Further, you also need to install timm and einops for EVA model:

pip install timm einops

Data preparation

We use VOC2007 trainval to train and VOC2007 test to eval.

Download and extract Pascal VOC 2007 images and annotations:

# Download the data.
cd $HOME/data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtest_06-Nov-2007.tar
# Extract the data.
tar -xvf VOCtrainval_06-Nov-2007.tar
tar -xvf VOCtest_06-Nov-2007.tar

Now you should see VOCdevkit folder.

Then run voc2coco.py to convert VOC annotations to COCO format.

python voc2coco.py /path/to/VOCdevkit

Now you should see voc_train.json and voc_val.json.

We expect the dataset directory structure to be the following:

path/to/dataset/
  annotations/
  	voc_train.json
  	voc_val.json
  images/
  	train/	# VOC 2007 trainval images
  	val/	# VOC 2007 test images

Training

Before finetuning on VOC2007, you need download the pre-trained model.

To train the original VOC-YOLOS-Ti model on VOC2007, run this command:


python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 2 --lr 2.5e-5 --epochs 300 --backbone_name tiny --pre_trained path/to/deit-tiny.pth --eval_size 512 --init_pe_size 608 800 --output_dir /output/path/box_model

To train the original VOC-YOLOS-S model on VOC2007, run this command:


python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --lr 2.5e-5 --epochs 150 --backbone_name small --pre_trained path/to/deit-small-300epoch.pth --eval_size 512 --init_pe_size 512 864 --mid_pe_size 512 864 --output_dir /output/path/box_model

To train the VOC-YOLOS-EVA-Ti model on VOC2007, run this command:


python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 2 --lr 2.5e-5 --epochs 300 --model_name eva --backbone_name tiny --pre_trained path/to/eva02_Ti_pt_in21k_ft_in1k_p14.pt --eval_size 512 --init_pe_size 608 800 --output_dir /output/path/box_model

To train the EVA-YOLOS-EVA-S model on VOC2007, run this command:


python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --lr 2.5e-5 --epochs 150 --model_name eva --backbone_name small --pre_trained path/to/eva02_S_pt_in21k_ft_in1k_p14.pt --eval_size 512 --init_pe_size 608 800 --output_dir /output/path/box_model

To apply attention partial finetune on EVA-YOLOS-EVA-S-MIM model on VOC2007, run this command:


python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --lr 2.5e-5 --epochs 150 --model_name eva --backbone_name small_mim --use_partial_finetune --partial_finetune_type attn --pre_trained path/to/eva02_S_pt_in21k_p14.pt --eval_size 512 --init_pe_size 608 800 --output_dir /output/path/box_model

Evaluation

To evaluate VOC-YOLOS-Ti model on VOC2007 test, run:

python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 2 --backbone_name tiny --eval --eval_size 512 --init_pe_size 608 800 --resume path/to/voc_yolos_ti.pth

To evaluate VOC-YOLOS-S model on VOC2007 test, run:

python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --backbone_name small --eval --eval_size 512 --init_pe_size 512 864 --mid_pe_size 512 864 --resume path/to/voc_yolos_s.pth

To evaluate VOC-YOLOS-EVA-Ti model on VOC2007 test, run:

python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 2 --model_name eva --backbone_name tiny --eval --eval_size 512 --init_pe_size 608 800 --resume path/to/voc_yolos_eva_ti.pth

To evaluate VOC-YOLOS-EVA-S model on VOC2007 test, run:

python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --model_name eva --backbone_name small --eval --eval_size 512 --init_pe_size 608 800 --resume path/to/voc_yolos_eva_s.pth

To evaluate VOC-YOLOS-EVA-S-attn-0 model on VOC2007 test, run:

python -m torch.distributed.launch --nproc_per_node=3 --use_env main.py --coco_path /path/to/dataset --dataset_file voc --batch_size 1 --model_name eva --backbone_name small_mim --eval --eval_size 512 --init_pe_size 608 800 --resume path/to/voc_eva_s_mim_frozen_attn_0.pth

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
datasets		datasets
models		models
util		util
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
VisualizeAttention.ipynb		VisualizeAttention.ipynb
cocoval_gtclsjson_generation.py		cocoval_gtclsjson_generation.py
cocoval_predjson_generation.py		cocoval_predjson_generation.py
engine.py		engine.py
main.py		main.py
requirements.txt		requirements.txt
visualize_dettoken_dist.py		visualize_dettoken_dist.py
voc2coco.py		voc2coco.py
yolos.png		yolos.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YOLOS with EVA-02

Introduction

Results

Partial finetune results

Requirement

Data preparation

Training

Evaluation

About

Releases

Packages

Languages

License

Robert-zwr/YOLOS-EVA

Folders and files

Latest commit

History

Repository files navigation

YOLOS with EVA-02

Introduction

Results

Partial finetune results

Requirement

Data preparation

Training

Evaluation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages