-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About choice of Tokenizers #14
Comments
it's on the roadmap, but won't be covered by the GSoC project. I'm considering wrapping the rust implementation in huggingface/tokenizer with BinaryBuilder.jl. |
This project needs more love from the community. Looks like progress is stalled? @chengchingwen |
Yes. I'll love to see more people contributing to this project. Currently I'm quite busy (working and studying) and therefore Some update/thought about tokenizers:
|
I implemented the word-piece tokenizer using native Julia, named BertWordPieceTokenizer, and I've registered it to the JuliaHub. Currently, it works to load the word-piece tokenizers, e.g. BERT, RoBERTa but fails for sentencepiece tokenizers such as ALBERT. |
@SeanLee97 Actually, we already have word piece tokenizer in Transformers.jl. See here and here |
@chengchingwen What do we do about the Huggingface Bert Tokenizer? Is that in the plan?
The text was updated successfully, but these errors were encountered: