issues with accents on convert_ids_to_tokens() #35

perezjln · 2018-11-18T20:41:24Z

Hello, the BertTokenizer seems loose accents when convert_ids_to_tokens() is used :

Example:

original sentence: "great breakfasts in a nice furnished cafè, slightly bohemian."
corresponding list of token produced : ['great', 'breakfast', '##s', 'in', 'a', 'nice', 'fur', '##nis', '##hed', 'cafe', ',', 'slightly', 'bohemia', '##n', '.']

Here the problem is in "cafe" that loses its accent. I'm using BertTokenizer.from_pretrained('Bert-base-multilingual') as the tokenizer, I also tried with "Bert-base-uncased" and experienced the same issue.

Thanks for this great work!

elyase · 2018-11-18T21:14:38Z

This is expected behaviour and is how the multilingual and the uncased models were trained. From the original repo:

We are releasing the BERT-Base and BERT-Large models from the paper. Uncased means that the text has been lowercased before WordPiece tokenization, e.g., John Smith becomes john smith. The Uncased model also strips out any accent markers.

thomwolf · 2018-11-19T08:39:56Z

Yes this is expected.

update adversarial training for roberta question anwsering (cont.)

Smart execution providers (Merges huggingface#35 into main)

thomwolf closed this as completed Nov 19, 2018

maeotaku mentioned this issue May 23, 2019

bert->onnx ->caffe2 weird error #633

Closed

AndreasFdev mentioned this issue Jun 6, 2019

MRPC / SQuAD stuck in "Running training" #662

Closed

stevezheng23 added a commit to stevezheng23/transformers that referenced this issue Mar 24, 2020

Merge pull request huggingface#35 from stevezheng23/dev/zheng/coqa

9d8c70e

update adversarial training for roberta question anwsering (cont.)

lwmlyy mentioned this issue Aug 15, 2023

add util for ram efficient loading of model when using fsdp #25107

Merged

1 task

ocavue pushed a commit to ocavue/transformers that referenced this issue Sep 13, 2023

Merge pull request huggingface#46 from xenova/smart-execution-providers

b936cb8

Smart execution providers (Merges huggingface#35 into main)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issues with accents on convert_ids_to_tokens() #35

issues with accents on convert_ids_to_tokens() #35

perezjln commented Nov 18, 2018

elyase commented Nov 18, 2018

thomwolf commented Nov 19, 2018

issues with accents on convert_ids_to_tokens() #35

issues with accents on convert_ids_to_tokens() #35

Comments

perezjln commented Nov 18, 2018

elyase commented Nov 18, 2018

thomwolf commented Nov 19, 2018