Influence of the "conv_kernel_size" within the proposed Nystrom Attention #5

PkuRainBow · 2021-03-23T07:08:36Z

Congrats on your great work!

I am verifying your method on vision tasks and have a small concern on the influence of the "conv_kernel_size" of the 2D group-convolution in your task and I find that you choose relatively large numbers such as 35.

In vision tasks, applying the convolution with such a large kernel size is typically for ensuring a larger receptive field. Considering the proposed Nystrom attention already has the capability to model the long-range context following the original Multi-Head Attention. In summary, I am a little bit confused about the motivation of such a design.

Another important concern is that: should we set the num_landmarks equal to the feature map width as the image feature maps are of grid structure?

It would be great if you could share your advice on the influence of this parameter!

Nystromformer/code/attention_nystrom.py

Lines 23 to 28 in effde25

    
           self.conv = nn.Conv2d( 
        
               in_channels = self.num_head, out_channels = self.num_head, 
        
               kernel_size = (config["conv_kernel_size"], 1), padding = (config["conv_kernel_size"] // 2, 0), 
        
               bias = False, 
        
               groups = self.num_head)

Nystromformer/LRA/code/lra_config.py

Lines 46 to 52 in 2bcc280

    
           "extra_attn_config":{ 
        
               "softmax":{"attention_grad_checkpointing":True}, 
        
               "nystrom-32":{"attention_grad_checkpointing":False, "num_landmarks":32, "conv_kernel_size":35}, 
        
               "nystrom-64":{"attention_grad_checkpointing":False, "num_landmarks":64, "conv_kernel_size":35}, 
        
               "nystrom-128":{"attention_grad_checkpointing":False, "num_landmarks":128, "conv_kernel_size":35}, 
        
               "nystrom-256":{"attention_grad_checkpointing":False, "num_landmarks":256, "conv_kernel_size":35}, 
        
               "linformer-256":{"attention_grad_checkpointing":False, "linformer_k":256},

yyxiongzju · 2021-03-31T02:04:29Z

@PkuRainBow, Thanks for your interest. There are several deadlines recently. Sorry for the late reply.

The basic idea for this design, (a), it helps for faster training; (b), it helps capture the local details without requiring many landmarks on language modeling tasks. For vision task, we found that the local details are not that important comparing to language modeling tasks. I recommend you reduce the kernel size to a small number or without using it. For example, I directly run a trained model T2t-Vit-t-14 on ImageNet without retraining for inference by replacing the self-attention part in T2T module by Nystromformer with the kernel size = 0 and num_landmarks = 64 and it works pretty well, 78% top-1 accuracy. For Performer, it is 73.7%. For Linformer, it is 65.3%.

With respect to setting num_landmarks, it may depend on your tasks. When you use more landmarks, it will be more accurate to approximate standard self-attention. Based on my experience, num_landmarks = 64 works well for image classification. If you want higher accuracy, you can try to increase num_landmarks, e.g.128.

PkuRainBow · 2021-04-04T03:27:46Z

@yyxiongzju Thanks for your detailed explanation!

According to your comments, we guess that your method will be promising on the vision transformer tasks and we are wondering whether you have tried to retrain your method with DeiT or T2T-Vit by replacing all the MHSA with the Nystrom scheme.

In fact, I have tried to do such a change by replacing all the MHSA with the Nystrom scheme (based on DeiT for ImageNet classificaiton) but find the loss becomes NAN at the early stage. It would be great if you could share with me your comments!

yyxiongzju · 2021-04-06T19:29:58Z

@PkuRainBow, I did not retrain DeiT or T2T-ViT with the Nystrom scheme. But I did try it on object detection. It works well.
I shared the code of using T2T-ViT with the Nystrom scheme by replacing all the MHSA.

I saw the NAN issue in the original T2T-ViT github repo. Can you run the code I shared to see if it works well?

PkuRainBow · 2021-04-09T09:29:34Z

@yyxiongzju Thanks for your reply! I will try the shared code soon.

RobertHua96 mentioned this issue Dec 28, 2021

Nystrom for Image processing lucidrains/nystrom-attention#10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Influence of the "conv_kernel_size" within the proposed Nystrom Attention #5

Influence of the "conv_kernel_size" within the proposed Nystrom Attention #5

PkuRainBow commented Mar 23, 2021 •

edited

Loading

yyxiongzju commented Mar 31, 2021

PkuRainBow commented Apr 4, 2021 •

edited

Loading

yyxiongzju commented Apr 6, 2021

PkuRainBow commented Apr 9, 2021

Influence of the "conv_kernel_size" within the proposed Nystrom Attention #5

Influence of the "conv_kernel_size" within the proposed Nystrom Attention #5

Comments

PkuRainBow commented Mar 23, 2021 • edited Loading

yyxiongzju commented Mar 31, 2021

PkuRainBow commented Apr 4, 2021 • edited Loading

yyxiongzju commented Apr 6, 2021

PkuRainBow commented Apr 9, 2021

PkuRainBow commented Mar 23, 2021 •

edited

Loading

PkuRainBow commented Apr 4, 2021 •

edited

Loading