Searching the Search Space of Vision Transformer

This is an official implementation of S3.

In this work, instead of searching the architecture in a predefined search space, with the help of AutoFormer, we proposed to search the search space to automatically find a great search space first. After that we search the architectures in the searched space. In addition, we provide insightful observations and guidelines for general vision transformer design.

Environment Setup

To set up the enviroment you can easily run the following command:

conda create -n SSS python=3.6
conda activate SSS
pip install -r requirements.txt

Data Preparation

You need to first download the ImageNet-2012 to the folder ./data/imagenet and move the validation set to the subfolder ./data/imagenet/val. To move the validation set, you cloud use the following script: https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh

The directory structure is the standard layout as following.

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Model Zoo

For evaluation, we provide the checkpoints and configs of our models.

After downloading the models, you can do the evaluation following the description in Evaluation).

Model download links:

Model	Params.	Top-1 Acc. %	Top-5 Acc. %	Model
AutoFormerV2-T	28M	82.1	95.8	link/config
AutoFormerV2-S	50M	83.7	96.4	link/config
AutoFormerV2-B	71M	84.0	96.6	link/config

Evaluation

To evaluate our trained models, you need to put the downloaded model in /PATH/TO/CHECKPOINT. After that you could use the following command to test the model (Please change your config file and model checkpoint according to different models. Here we use the AutoFormer-B as an example).

python -m torch.distributed.launch --nproc_per_node=8 --use_env evaluation.py --data-path /PATH/TO/IMAGENT \
--dist-eval --cfg ./config/S3-B.yaml --resume /PATH/TO/CHECKPOINT --eval

Performance

We give the performance comparison between S3 and other state-of-the-art methods under different resources constraint in terms of Top-1 accuracy on ImageNet. Our method achieves very competitive performance, being superior to the recent DeiT, ViT, Swin.

Bibtex

If this repo is helpful for you, please consider to cite it. Thank you! :)

@article{S3,
  title={Searching the Search Space of Vision Transformer},
  author={Minghao, Chen and Kan, Wu and Bolin, Ni and Houwen, Peng and Bei, Liu and Jianlong, Fu and Hongyang, Chao and Haibin, Ling},
  booktitle={Conference and Workshop on Neural Information Processing Systems (NeurIPS)},
  year={2021}
}

Acknowledgements

The codes are inspired by HAT, timm, DeiT, SPOS, AutoFormer, Swin .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Searching the Search Space of Vision Transformer

Environment Setup

Data Preparation

Model Zoo

Evaluation

Performance

Bibtex

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Searching the Search Space of Vision Transformer

Environment Setup

Data Preparation

Model Zoo

Evaluation

Performance

Bibtex

Acknowledgements