You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was able to reproduce MaskClip and MaskClip+ with ViT-B/16 + R101 on Pascal context dataset. The result mAp is 25.45 and 29.48 respecitively.
However, when I tried to change the model to ViT-B/32 and ViT-L/14 the result is not good, less than half of ViT-B/16 and the quanlitative result shows that the predicted dense label is generally a mess.
What I did was:
convert weight and backbone and extract text embeddings for ViT-B/32 and ViT-L/14
create a config accoding to ViT-B/16, with modifications:
change the patch size to 32 for ViT-B/32
change the pathc size to 14, embed_dims to 1024, num_layers to 24 for ViT-L/14
Is there anything I've done wrong or misunderstood? Do you have any suggestions on why the result is bad?
Thanks in advance.
The text was updated successfully, but these errors were encountered:
Thanks for the wonderful paper and repo.
I was able to reproduce MaskClip and MaskClip+ with ViT-B/16 + R101 on Pascal context dataset. The result mAp is 25.45 and 29.48 respecitively.
However, when I tried to change the model to ViT-B/32 and ViT-L/14 the result is not good, less than half of ViT-B/16 and the quanlitative result shows that the predicted dense label is generally a mess.
What I did was:
Is there anything I've done wrong or misunderstood? Do you have any suggestions on why the result is bad?
Thanks in advance.
The text was updated successfully, but these errors were encountered: