-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Influence of the "conv_kernel_size" within the proposed Nystrom Attention #5
Comments
@PkuRainBow, Thanks for your interest. There are several deadlines recently. Sorry for the late reply. The basic idea for this design, (a), it helps for faster training; (b), it helps capture the local details without requiring many landmarks on language modeling tasks. For vision task, we found that the local details are not that important comparing to language modeling tasks. I recommend you reduce the kernel size to a small number or without using it. For example, I directly run a trained model T2t-Vit-t-14 on ImageNet without retraining for inference by replacing the self-attention part in T2T module by Nystromformer with the kernel size = 0 and num_landmarks = 64 and it works pretty well, 78% top-1 accuracy. For Performer, it is 73.7%. For Linformer, it is 65.3%. With respect to setting num_landmarks, it may depend on your tasks. When you use more landmarks, it will be more accurate to approximate standard self-attention. Based on my experience, num_landmarks = 64 works well for image classification. If you want higher accuracy, you can try to increase num_landmarks, e.g.128. |
@yyxiongzju Thanks for your detailed explanation! According to your comments, we guess that your method will be promising on the vision transformer tasks and we are wondering whether you have tried to retrain your method with DeiT or T2T-Vit by replacing all the MHSA with the Nystrom scheme. In fact, I have tried to do such a change by replacing all the MHSA with the Nystrom scheme (based on DeiT for ImageNet classificaiton) but find the loss becomes NAN at the early stage. It would be great if you could share with me your comments! |
@PkuRainBow, I did not retrain DeiT or T2T-ViT with the Nystrom scheme. But I did try it on object detection. It works well. I saw the NAN issue in the original T2T-ViT github repo. Can you run the code I shared to see if it works well? |
@yyxiongzju Thanks for your reply! I will try the shared code soon. |
Congrats on your great work!
I am verifying your method on vision tasks and have a small concern on the influence of the "conv_kernel_size" of the 2D group-convolution in your task and I find that you choose relatively large numbers such as 35.
In vision tasks, applying the convolution with such a large kernel size is typically for ensuring a larger receptive field. Considering the proposed Nystrom attention already has the capability to model the long-range context following the original Multi-Head Attention. In summary, I am a little bit confused about the motivation of such a design.
Another important concern is that: should we set the num_landmarks equal to the feature map width as the image feature maps are of grid structure?
It would be great if you could share your advice on the influence of this parameter!
Nystromformer/code/attention_nystrom.py
Lines 23 to 28 in effde25
Nystromformer/LRA/code/lra_config.py
Lines 46 to 52 in 2bcc280
The text was updated successfully, but these errors were encountered: