Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make special image tokens attribute of tokenizer #31967

Closed

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jul 15, 2024

What does this PR do?

This PR adds three most commonly used special tokens in VLMs as attributes to the tokenizer, following on #31534 (comment). Now we don't have to add as kwargs to init/hardcode the special tokens in processing code. We can call simply self.tokenizer.image_token.

The PR only adds an attribute in tokenizer's code, all the changes for specific VLMs will go on another PR. I believe it's only Idefics models that uses special tokens within processing and it will require BC.

Other models will benefit from this feature after VLM processor refactor is merged.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@amyeroberts
Copy link
Collaborator

Thanks for opening this PR @zucchini-nlp!

I'm not sure this is something we want to add as a property for all tokenizers: they predominantly work with text. In this regard, it makes sense to talk about "end of sentence" as pretty much all tokenizers work with text in sentence structures. However image tokens aren't a general properly of text data (what does it mean for a bert tokenizer to have an image token?).

I would instead try to restructure such that these are only added to certain tokenizers e.g. bundle as a mixin which we can selectively add.

@leloykun
Copy link
Contributor

leloykun commented Jul 15, 2024

I also suggest adding an image_token_ids attribute that contains all the VQVAE image tokens (excluding the start & end marker tokens)

@zucchini-nlp
Copy link
Member Author

@amyeroberts I see what you mean, but going your way will require some indicator that the tokenizer is loaded from VLM because we load the same tokenizer classes as LMs. I can give it a try and add an indicator through config file, and depending on that add new special token mixin. Will ping you when it's ready

@leloykun in case it goes to its own VLMSpecialTokensMixin, I can add image_token_ids, sure

@amyeroberts
Copy link
Collaborator

@zucchini-nlp OK - I'm open to other solutions - a mixin might not be a good idea! I just want to avoid leakiness in our abstractions.

@leloykun
Copy link
Contributor

Thanks for this @zucchini-nlp btw!

Currently I have to do something this:
image
which is a PITA

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! I think we should be careful as @amyeroberts says here.
IMO the best would be to have a VlmTokenizer class, which we wrap around the tokenizer class used. We have a list of SPECIAL_TOKENS_ATTRIBUTES that can be passed to it, taken from the config as allowed or acessible or just special tokens.

We could have getters and setters automatically inferred / have a way to correctly set the id of the token that does not conflict.

One current limitatio is that setting the id of the token does not really check existence in the vocab. Thus you have edge cases. Good time to fix them!

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Aug 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants