Add FastViT model #26172

JorgeAV-ai · 2023-09-14T22:53:12Z

What does this PR do?

Fixes #25526
I have seen that the PR is still open and still no PR has been submitted during these weeks so I have decided to open mine once I finish the model structure + testing

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
cc: @amyeroberts

This reverts commit 118dad1.

ArthurZucker · 2023-09-15T14:33:07Z

cc @rafaelpadilla 😉 to keep your eyes on this!

This reverts commit 118dad1.

rafaelpadilla

Hi @JorgeAV-ai,
Nice work! :)
Tests should be all green before we pass it to a core maintainers review. Also noted few conversations that should be resolved.
Please, let me know if you need help with them.

docs/source/en/model_doc/fastvit.md

tests/models/fastvit/test_modeling_fastvit.py

JorgeAV-ai · 2023-10-11T18:22:44Z

I added some questions above. I also noticed that some of your comments might be related to an outdated code. Would you mind taking a look again? Thanks 😊

mjamroz · 2023-10-12T12:28:55Z

src/transformers/models/fastvit/convert_fastvit_timm_to_pytorch.py

+    if "stem" in name:
+        name = name.replace("stem", "embeddings.patch_embeddings.projection")
+    if "conv_kxk" in name:
+        name = name.replace("conv_kxk", "rbr_conv")
+    if "conv_scale" in name:
+        name = name.replace("conv_scale", "rbr_scale")
+    if "identity" in name:
+        name = name.replace("identity", "rbr_skip")
+    if "0.conv" in name:
+        name = name.replace("0.conv", "conv")
+    if "0.bn" in name:
+        name = name.replace("0.bn", "bn")
+    if "stages" in name:
+        name = name.replace("stages", "encoder.layer")
+    if "blocks" in name:
+        name = name.replace("blocks", "stage_conv")
+    if "layer_scale.gamma" in name:
+        name = name.replace("layer_scale.gamma", "layer_scale")
+        name = name.replace("token_mixer", "token_mixer_block.token_mixer")
+    if "layer_scale_1.gamma" in name:
+        name = name.replace("layer_scale_1.gamma", "layer_scale_1")
+    if "layer_scale_2.gamma" in name:
+        name = name.replace("layer_scale_2.gamma", "layer_scale_2")
+    if "token_mixer.norm" in name:
+        name = name.replace("token_mixer.norm", "token_mixer_block.token_mixer.norm")
+    if "token_mixer.mixer" in name:
+        name = name.replace("token_mixer.mixer", "token_mixer_block.token_mixer.mixer")
+    if "mlp" in name:
+        name = name.replace("mlp", "convffn")
+    if ".conv.conv" in name:
+        name = name.replace("conv.conv", "conv")
+    if ".conv.bn" in name:
+        name = name.replace("conv.bn", "bn")
+    if "proj." in name:
+        if "token_mixer" not in name:
+            name_split = name.split(".")
+            pos = int(name_split[2])
+            name_split[2] = str(pos - 1)
+            if int(name_split[5]) == 0:
+                name_split[4] = "reparam_large_conv"
+            else:
+                name_split[4] = "conv"
+            name_split.pop(5)  # drop the 0 or 1....
+            name = ".".join(name_split)
+        else:
+            name = name.replace("token_mixer.proj", "token_mixer_block.attention.proj")
+    if "se.fc1" in name:
+        name = name.replace("se.fc1", "se.reduce")
+    if "se.fc2" in name:
+        name = name.replace("se.fc2", "se.expand")
+    if "q_bias" in name:
+        name = name.replace("q_bias", "query.bias")
+    if "k_bias" in name:
+        name = name.replace("k_bias", "key.bias")
+    if "v_bias" in name:
+        name = name.replace("v_bias", "value.bias")


maybe (but double check if gives the same data):

Suggested change

if "stem" in name:

name = name.replace("stem", "embeddings.patch_embeddings.projection")

if "conv_kxk" in name:

name = name.replace("conv_kxk", "rbr_conv")

if "conv_scale" in name:

name = name.replace("conv_scale", "rbr_scale")

if "identity" in name:

name = name.replace("identity", "rbr_skip")

if "0.conv" in name:

name = name.replace("0.conv", "conv")

if "0.bn" in name:

name = name.replace("0.bn", "bn")

if "stages" in name:

name = name.replace("stages", "encoder.layer")

if "blocks" in name:

name = name.replace("blocks", "stage_conv")

if "layer_scale.gamma" in name:

name = name.replace("layer_scale.gamma", "layer_scale")

name = name.replace("token_mixer", "token_mixer_block.token_mixer")

if "layer_scale_1.gamma" in name:

name = name.replace("layer_scale_1.gamma", "layer_scale_1")

if "layer_scale_2.gamma" in name:

name = name.replace("layer_scale_2.gamma", "layer_scale_2")

if "token_mixer.norm" in name:

name = name.replace("token_mixer.norm", "token_mixer_block.token_mixer.norm")

if "token_mixer.mixer" in name:

name = name.replace("token_mixer.mixer", "token_mixer_block.token_mixer.mixer")

if "mlp" in name:

name = name.replace("mlp", "convffn")

if ".conv.conv" in name:

name = name.replace("conv.conv", "conv")

if ".conv.bn" in name:

name = name.replace("conv.bn", "bn")

if "proj." in name:

if "token_mixer" not in name:

name_split = name.split(".")

pos = int(name_split[2])

name_split[2] = str(pos - 1)

if int(name_split[5]) == 0:

name_split[4] = "reparam_large_conv"

else:

name_split[4] = "conv"

name_split.pop(5) # drop the 0 or 1....

name = ".".join(name_split)

else:

name = name.replace("token_mixer.proj", "token_mixer_block.attention.proj")

if "se.fc1" in name:

name = name.replace("se.fc1", "se.reduce")

if "se.fc2" in name:

name = name.replace("se.fc2", "se.expand")

if "q_bias" in name:

name = name.replace("q_bias", "query.bias")

if "k_bias" in name:

name = name.replace("k_bias", "key.bias")

if "v_bias" in name:

name = name.replace("v_bias", "value.bias")

for name_from, name_to in (

("stem", "embeddings.patch_embeddings.projection"),

("conv_kxk", "rbr_conv"),

("conv_scale", "rbr_scale"),

("identity", "rbr_skip"),

("0.conv", "conv"),

("0.bn", "bn"),

("stages", "encoder.layer"),

("blocks", "stage_conv"),

("layer_scale.gamma", "layer_scale"),

("token_mixer", "token_mixer_block.token_mixer"),

("layer_scale_1.gamma", "layer_scale_1"),

("layer_scale_2.gamma", "layer_scale_2"),

("token_mixer.norm", "token_mixer_block.token_mixer.norm"),

("se.fc1", "se.reduce"),

("se.fc2", "se.expand"),

("q_bias", "query.bias"),

("k_bias", "key.bias"),

("v_bias", "value.bias"),

("token_mixer.mixer", "token_mixer_block.token_mixer.mixer"),

("mlp", "convffn"),

):

name = name.replace(name_from, name_to)

if ".conv.conv" in name:

name = name.replace("conv.conv", "conv")

if ".conv.bn" in name:

name = name.replace("conv.bn", "bn")

if "proj." in name:

if "token_mixer" not in name:

name_split = name.split(".")

pos = int(name_split[2])

name_split[2] = str(pos - 1)

if int(name_split[5]) == 0:

name_split[4] = "reparam_large_conv"

else:

name_split[4] = "conv"

name_split.pop(5) # drop the 0 or 1....

name = ".".join(name_split)

else:

name = name.replace("token_mixer.proj", "token_mixer_block.attention.proj")

Please take into account that I followed the structure that was suggested in the template!! I guess they want it in this way...

We don't necessarily enforce having what is in the template. I would suggest having a single dictionnary that maps old keys to new keys for a more readable code

mjamroz · 2023-10-12T12:30:35Z

src/transformers/models/fastvit/convert_fastvit_timm_to_pytorch.py

+    im = Image.open(requests.get(url, stream=True).raw)
+    return im


Suggested change

im = Image.open(requests.get(url, stream=True).raw)

return im

return Image.open(requests.get(url, stream=True).raw)

mjamroz · 2023-10-12T12:34:16Z

src/transformers/models/fastvit/convert_fastvit_timm_to_pytorch.py

+
+
+def convert_state_dict(orig_state_dict, model):
+    for key in orig_state_dict.copy().keys():


Suggested change

for key in orig_state_dict.copy().keys():

for key in orig_state_dict:

mjamroz · 2023-10-12T12:36:40Z

src/transformers/models/fastvit/convert_fastvit_timm_to_pytorch.py

+        val = orig_state_dict.pop(key)
+        if "mask" in key:
+            continue


convert_state_dict() strips all *mask* elements of orig_state_dict. Is it intended?

rafaelpadilla

Thank you for your contribution! :)
Just a small nit.

Hey @ArthurZucker , I did a few iterations and finished the first pass.
Could you please, take it from now? :)

docs/source/en/model_doc/fastvit.md

ArthurZucker

My first general comment is to use full camel for the class name so FastViT -> FastVit.
My second question is why the inference mode has to be passed. Guessing it's for memory efficiency? But then do you intend to push 2 different checkpoints, 1 for inference the other for trianing?
Also the docstring of the classes referencing to the model don't really help as they don't describe what is happening inside. Let's add a scheme if we can to the FastVit.md with what makes it so fast.
Also the SefAttention layer seems pretty standard, could have some copied from here!

ArthurZucker · 2023-10-19T14:44:36Z

README.md

@@ -356,6 +356,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h
 1. **[ErnieM](https://huggingface.co/docs/transformers/model_doc/ernie_m)** (from Baidu) released with the paper [ERNIE-M: Enhanced Multilingual Representation by Aligning Cross-lingual Semantics with Monolingual Corpora](https://arxiv.org/abs/2012.15674) by Xuan Ouyang, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang.
 1. **[ESM](https://huggingface.co/docs/transformers/model_doc/esm)** (from Meta AI) are transformer protein language models.  **ESM-1b** was released with the paper [Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences](https://www.pnas.org/content/118/15/e2016239118) by Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C. Lawrence Zitnick, Jerry Ma, and Rob Fergus. **ESM-1v** was released with the paper [Language models enable zero-shot prediction of the effects of mutations on protein function](https://doi.org/10.1101/2021.07.09.450648) by Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu and Alexander Rives. **ESM-2 and ESMFold** were released with the paper [Language models of protein sequences at the scale of evolution enable accurate structure prediction](https://doi.org/10.1101/2022.07.20.500902) by Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Allan dos Santos Costa, Maryam Fazel-Zarandi, Tom Sercu, Sal Candido, Alexander Rives.
 1. **[Falcon](https://huggingface.co/docs/transformers/model_doc/falcon)** (from Technology Innovation Institute) by Almazrouei, Ebtesam and Alobeidli, Hamza and Alshamsi, Abdulaziz and Cappelli, Alessandro and Cojocaru, Ruxandra and Debbah, Merouane and Goffinet, Etienne and Heslow, Daniel and Launay, Julien and Malartic, Quentin and Noune, Badreddine and Pannier, Baptiste and Penedo, Guilherme.
+1. **[FastViT](https://huggingface.co/docs/transformers/model_doc/fastvit)** (from Apple) released with the paper [FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization](https://arxiv.org/abs/2303.14189) by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.


Suggested change

1. **[FastViT](https://huggingface.co/docs/transformers/model_doc/fastvit)** (from Apple) released with the paper [FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization](https://arxiv.org/abs/2303.14189) by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.

1. **[FastViT](https://huggingface.co/docs/transformers/main/model_doc/fastvit)** (from Apple) released with the paper [FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization](https://arxiv.org/abs/2303.14189) by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.

before a release the doc should make sur to point to main

ArthurZucker · 2023-10-19T14:45:25Z

docs/source/en/index.md

This seems wrong, we should not have these changes. Make sure remove this change (the table needs to be updated but not the list of supported model)

ArthurZucker · 2023-10-19T14:47:55Z

docs/source/en/model_doc/fastvit.md

+FastViT is a hybrid Transformer with some several modifications, such as replacing denses with a factored version, 
+replace self-attention to large kernel convolutions, with the objective of reducing latency.


Needs to be rephrased and gramarly checked

ArthurZucker · 2023-10-19T14:48:09Z

docs/source/en/model_doc/fastvit.md

+The FastViT model was proposed in [FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization](https://arxiv.org/abs/2303.14189) by Pavan Kumar Anasosalu Vasu, James Gabriel, Jeff Zhu, Oncel Tuzel and Anurag Ranjan.
+FastViT is a hybrid Transformer with some several modifications, such as replacing denses with a factored version, 
+replace self-attention to large kernel convolutions, with the objective of reducing latency.
+The authors claims that FastViT is 3.5× faster than CMT, a recent state-of-the-art hybrid transformer architecture, 


missing a link to CMT

ArthurZucker · 2023-10-19T14:48:27Z

docs/source/en/model_doc/fastvit.md

+FastViT is a hybrid Transformer with some several modifications, such as replacing denses with a factored version, 
+replace self-attention to large kernel convolutions, with the objective of reducing latency.
+The authors claims that FastViT is 3.5× faster than CMT, a recent state-of-the-art hybrid transformer architecture, 
+4.9× faster than EfficientNet, and 1.9× faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset.   


I think efficient net is part of transformers we can also link it!

ArthurZucker · 2023-10-19T15:13:04Z