Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues with accents on convert_ids_to_tokens() #35

Closed
perezjln opened this issue Nov 18, 2018 · 2 comments
Closed

issues with accents on convert_ids_to_tokens() #35

perezjln opened this issue Nov 18, 2018 · 2 comments

Comments

@perezjln
Copy link

Hello, the BertTokenizer seems loose accents when convert_ids_to_tokens() is used :

Example:

  • original sentence: "great breakfasts in a nice furnished cafè, slightly bohemian."
  • corresponding list of token produced : ['great', 'breakfast', '##s', 'in', 'a', 'nice', 'fur', '##nis', '##hed', 'cafe', ',', 'slightly', 'bohemia', '##n', '.']

Here the problem is in "cafe" that loses its accent. I'm using BertTokenizer.from_pretrained('Bert-base-multilingual') as the tokenizer, I also tried with "Bert-base-uncased" and experienced the same issue.

Thanks for this great work!

@elyase
Copy link
Contributor

elyase commented Nov 18, 2018

This is expected behaviour and is how the multilingual and the uncased models were trained. From the original repo:

We are releasing the BERT-Base and BERT-Large models from the paper. Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith. The Uncased model also strips out any accent markers.

@thomwolf
Copy link
Member

Yes this is expected.

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020
update adversarial training for roberta question anwsering (cont.)
ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants