You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current version uses Sliding Window CLIP ViT for each (3, 1120,1120) pixel value to generate (1408, 80, 80) feature map. I want to extend it by using per-pixel CLIP (same as OSM) feature like LSeg then sampled (pooled) it ~ to the same feature map shape. The rest remain unchanged. I wonder is it possible without retraining the whole model because I see there are some positional encoding in here. Basically, I think sampling from dense CLIP features might yield better results.
Looking forward to your response.
PhucNDA.
The text was updated successfully, but these errors were encountered:
Not sure if I fully understand your question. Do you mean instead of sliding-window CLIP, you would like to extract CLIP feature at resolution 224 x 224 (or 336 x 336), and then interpolate the feature map to target resolution?
Based on my experience, that should give an inferior performance than sliding window one. Besides, if you would like to change any module, I expect a significant performance degrade w/o fine-tuning, as the feature distribution should be changed.
Hi @cornettoyu
Have you tried discarding the [CLS] token from the beginning of the embeddings during training. Will it tremendously affect the final performance?
Hi @cornettoyu Have you tried discarding the [CLS] token from the beginning of the embeddings during training. Will it tremendously affect the final performance?
Sorry for the late reply. No we have not tried discard the cls token, but I suppose it should not affect the performance.
Hi, thanks for the significant work.
The current version uses Sliding Window CLIP ViT for each (3, 1120,1120) pixel value to generate (1408, 80, 80) feature map. I want to extend it by using per-pixel CLIP (same as OSM) feature like LSeg then sampled (pooled) it ~ to the same feature map shape. The rest remain unchanged. I wonder is it possible without retraining the whole model because I see there are some positional encoding in here. Basically, I think sampling from dense CLIP features might yield better results.
Looking forward to your response.
PhucNDA.
The text was updated successfully, but these errors were encountered: