transformer_qa reader raises ValueError for multilingual datasets #4052

MaksymDel · 2020-04-07T11:46:23Z

Hi, I am trying to run transformer_qa model with xlm_roberta_base, but I am running into this block:
https://github.com/allenai/allennlp-models/blob/336b757fd3687e6a08676f9dd3e9f1625836c73a/allennlp_models/rc/common/reader_utils.py#L84-L85

from this block:
https://github.com/allenai/allennlp-models/blob/336b757fd3687e6a08676f9dd3e9f1625836c73a/allennlp_models/rc/transformer_qa/transformer_squad.py#L171-L178

and get ValueError while reading instance nr. 6007 (so many of them pass successfully).

It would be very helpful if you could give any pointers on what could be wrong. The only change I've made is that the transformer_model type now equals xlm-roberta-base.

So here is an example for tokenization of xlm-r vs roberta :

>>> from allennlp.data.tokenizers import PretrainedTransformerTokenizer
>>> xr = PretrainedTransformerTokenizer("xlm-roberta-base", add_special_tokens=False, calculate_character_offsets=True)
>>> r = PretrainedTransformerTokenizer("roberta-base", add_special_tokens=False, calculate_character_offsets=True)
>>> xr.tokenize("a am writingcode evety day")
[▁a, ▁am, ▁writing, code, ▁ev, ety, ▁day]
>>> r.tokenize("a am writingcode evety day")
[a, Ġam, Ġwriting, code, Ġeve, ty, Ġday]

Plain roberta runs without problems...

The text was updated successfully, but these errors were encountered:

MaksymDel · 2020-04-09T07:37:20Z

UPD: both XLM-R and mBert also raise this ValueError on MLQA dataset. Approximately 100 instances get affected...

MaksymDel · 2020-04-18T18:06:01Z

So I just skipped these instances and trained models. It did not affect scores, so the issue is not critical to the multilingual models performance. I assume transformers repo also skips examples like this.

I think not much can be done here apart from just skipping corrupted instances instead of raising an exception in the code.

dirkgr · 2020-04-21T18:26:44Z

If it's a problem with the input data, it should also fail with BERT though.

dirkgr · 2020-04-21T18:35:25Z

Haha, that is some great annotation. The question is "How many times was the album delayed?". Context is "West's perfectionism led The College Dropout to have its release postponed three times from its initial date in August 2003.". The annotation says the answer is the "3" in "2003" 🤣.

dirkgr · 2020-04-21T23:03:45Z

I put a fix for this at allenai/allennlp-models#24.

MaksymDel · 2020-04-22T20:29:32Z

Hi @dirkgr, thanks for the fix!

Looks like I can now read English squad dataset tokenized by XLM-R tokenizer successfully.

However, when I try to read other languages data (e.g. Spanish), I get this ValueError that is left in the code (lines 107-108):

    if end_index >= len(token_offsets):
        raise ValueError(f"Character span %r outside the range of the given tokens.")

The data I use is this one (comes in squad format):
https://github.com/facebookresearch/MLQA

dirkgr · 2020-04-25T01:25:11Z

You are right! I put another fix here: allenai/allennlp-models#27

dirkgr · 2020-04-27T17:54:23Z

The second fix is now in too. I can train MLQA with this now.

dirkgr · 2020-04-27T17:54:52Z

I'll close this issue. Feel free to open a new one if you find more problems with it!

MaksymDel changed the title ~~transformer_qa fails for XLM-R~~ transformer_qa reader raises ValueError for some multilingual models / datasets Apr 9, 2020

schmmd transferred this issue from allenai/allennlp-models Apr 10, 2020

schmmd added the Models Issues related to the allennlp-models repo label Apr 10, 2020

dirkgr self-assigned this Apr 21, 2020

MaksymDel changed the title ~~transformer_qa reader raises ValueError for some multilingual models / datasets~~ transformer_qa reader raises ValueError for multilingual datasets Apr 22, 2020

dirkgr mentioned this issue Apr 25, 2020

Even better token annotation allenai/allennlp-models#27

Merged

dirkgr closed this as completed Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformer_qa reader raises ValueError for multilingual datasets #4052

transformer_qa reader raises ValueError for multilingual datasets #4052

MaksymDel commented Apr 7, 2020

MaksymDel commented Apr 9, 2020 •

edited

Loading

MaksymDel commented Apr 18, 2020 •

edited

Loading

dirkgr commented Apr 21, 2020

dirkgr commented Apr 21, 2020

dirkgr commented Apr 21, 2020

MaksymDel commented Apr 22, 2020

dirkgr commented Apr 25, 2020

dirkgr commented Apr 27, 2020

dirkgr commented Apr 27, 2020

transformer_qa reader raises ValueError for multilingual datasets #4052

transformer_qa reader raises ValueError for multilingual datasets #4052

Comments

MaksymDel commented Apr 7, 2020

MaksymDel commented Apr 9, 2020 • edited Loading

MaksymDel commented Apr 18, 2020 • edited Loading

dirkgr commented Apr 21, 2020

dirkgr commented Apr 21, 2020

dirkgr commented Apr 21, 2020

MaksymDel commented Apr 22, 2020

dirkgr commented Apr 25, 2020

dirkgr commented Apr 27, 2020

dirkgr commented Apr 27, 2020

MaksymDel commented Apr 9, 2020 •

edited

Loading

MaksymDel commented Apr 18, 2020 •

edited

Loading