-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
issues with accents on convert_ids_to_tokens() #35
Comments
This is expected behaviour and is how the multilingual and the uncased models were trained. From the original repo:
|
Yes this is expected. |
stevezheng23
added a commit
to stevezheng23/transformers
that referenced
this issue
Mar 24, 2020
update adversarial training for roberta question anwsering (cont.)
1 task
ocavue
pushed a commit
to ocavue/transformers
that referenced
this issue
Sep 13, 2023
Smart execution providers (Merges huggingface#35 into main)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hello, the BertTokenizer seems loose accents when convert_ids_to_tokens() is used :
Example:
Here the problem is in "cafe" that loses its accent. I'm using BertTokenizer.from_pretrained('Bert-base-multilingual') as the tokenizer, I also tried with "Bert-base-uncased" and experienced the same issue.
Thanks for this great work!
The text was updated successfully, but these errors were encountered: