-
Notifications
You must be signed in to change notification settings - Fork 26.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make special image tokens attribute of tokenizer #31967
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for opening this PR @zucchini-nlp! I'm not sure this is something we want to add as a property for all tokenizers: they predominantly work with text. In this regard, it makes sense to talk about "end of sentence" as pretty much all tokenizers work with text in sentence structures. However image tokens aren't a general properly of text data (what does it mean for a bert tokenizer to have an image token?). I would instead try to restructure such that these are only added to certain tokenizers e.g. bundle as a mixin which we can selectively add. |
I also suggest adding an |
@amyeroberts I see what you mean, but going your way will require some indicator that the tokenizer is loaded from VLM because we load the same tokenizer classes as LMs. I can give it a try and add an indicator through config file, and depending on that add new special token mixin. Will ping you when it's ready @leloykun in case it goes to its own VLMSpecialTokensMixin, I can add |
@zucchini-nlp OK - I'm open to other solutions - a mixin might not be a good idea! I just want to avoid leakiness in our abstractions. |
Thanks for this @zucchini-nlp btw! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! I think we should be careful as @amyeroberts says here.
IMO the best would be to have a VlmTokenizer class, which we wrap around the tokenizer class used. We have a list of SPECIAL_TOKENS_ATTRIBUTES
that can be passed to it, taken from the config as allowed
or acessible
or just special
tokens.
We could have getters and setters automatically inferred / have a way to correctly set the id of the token that does not conflict.
One current limitatio is that setting the id of the token does not really check existence in the vocab. Thus you have edge cases. Good time to fix them!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
This PR adds three most commonly used special tokens in VLMs as attributes to the tokenizer, following on #31534 (comment). Now we don't have to add as kwargs to init/hardcode the special tokens in
processing
code. We can call simplyself.tokenizer.image_token
.The PR only adds an attribute in tokenizer's code, all the changes for specific VLMs will go on another PR. I believe it's only Idefics models that uses special tokens within processing and it will require BC.
Other models will benefit from this feature after VLM processor refactor is merged.