-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨 fix(SigLip): remove spurious exclusion of first vision output token #30952
Conversation
There's another PR which suggested to use the pooled output rather than averaging the patch tokens: #30373. The reason I went with the latter was because of this paper (follow-up paper by the ViT authors which showed that it works better than the CLS token). However averaging patch tokens might work better for ViTs pre-trained using (self-)supervised image classification. For contrastive models like CLIP and SigLIP, perhaps it makes more sense to use the pooled output. I'm happy to change it, although we would need to make it backwards compatible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
Only thing we need to do is run the model slow tests. Pushing the following should trigger the necessary CI run:
git commit --allow-empty -m "[run-slow] siglip"
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
@transmissions11 Any update on this? |
As this is a breaking change, wondering whether we should just go for the pooled output anyway? It would however break any existing fine-tunes so we might need to add a flag for backwards compatibility. |
@NielsRogge It might be breaking but the previous logic was wrong -- there's no reason to exclude the first token. We don't keep backwards compatibility for bugs, and so I'm not sure it makes sense to have backwards compatibility here for the token. If we did add the pooled logic, then yes we should have a flag. I'd do this in a separate PR |
…huggingface#30952) fix(SigLip): remove spurious exclusion of first vision output token in classifier
…huggingface#30952) fix(SigLip): remove spurious exclusion of first vision output token in classifier
…huggingface#30952) fix(SigLip): remove spurious exclusion of first vision output token in classifier
Looks like the first token is spuriously excluded from the average pooling because this code was copy & pasted from
modeling_clip
, which does prepend a[CLS]
token. However SigLip has no such special token, so it would seem we are needlessly excluding a token from being included?p.s. why not use the
pooled_output
from the model's map head vs averaging the last hidden states?