Replacing Sliding Window CLIP ViT feature with Dense Perpixel CLIP feature without retraining #4

PhucNDA · 2024-01-11T11:25:12Z

Hi, thanks for the significant work.

The current version uses Sliding Window CLIP ViT for each (3, 1120,1120) pixel value to generate (1408, 80, 80) feature map. I want to extend it by using per-pixel CLIP (same as OSM) feature like LSeg then sampled (pooled) it ~ to the same feature map shape. The rest remain unchanged. I wonder is it possible without retraining the whole model because I see there are some positional encoding in here. Basically, I think sampling from dense CLIP features might yield better results.

Looking forward to your response.
PhucNDA.

cornettoyu · 2024-01-15T22:19:18Z

Hi,

Not sure if I fully understand your question. Do you mean instead of sliding-window CLIP, you would like to extract CLIP feature at resolution 224 x 224 (or 336 x 336), and then interpolate the feature map to target resolution?

Based on my experience, that should give an inferior performance than sliding window one. Besides, if you would like to change any module, I expect a significant performance degrade w/o fine-tuning, as the feature distribution should be changed.

Best,

PhucNDA · 2024-01-29T13:59:55Z

Hi @cornettoyu
Have you tried discarding the [CLS] token from the beginning of the embeddings during training. Will it tremendously affect the final performance?

cornettoyu · 2024-02-12T19:11:31Z

Hi @cornettoyu Have you tried discarding the [CLS] token from the beginning of the embeddings during training. Will it tremendously affect the final performance?

Sorry for the late reply. No we have not tried discard the cls token, but I suppose it should not affect the performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing Sliding Window CLIP ViT feature with Dense Perpixel CLIP feature without retraining #4

Replacing Sliding Window CLIP ViT feature with Dense Perpixel CLIP feature without retraining #4

PhucNDA commented Jan 11, 2024 •

edited

Loading

cornettoyu commented Jan 15, 2024

PhucNDA commented Jan 29, 2024

cornettoyu commented Feb 12, 2024

Replacing Sliding Window CLIP ViT feature with Dense Perpixel CLIP feature without retraining #4

Replacing Sliding Window CLIP ViT feature with Dense Perpixel CLIP feature without retraining #4

Comments

PhucNDA commented Jan 11, 2024 • edited Loading

cornettoyu commented Jan 15, 2024

PhucNDA commented Jan 29, 2024

cornettoyu commented Feb 12, 2024

PhucNDA commented Jan 11, 2024 •

edited

Loading