Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

transformer_qa reader raises ValueError for multilingual datasets #4052

Closed
MaksymDel opened this issue Apr 7, 2020 · 9 comments
Closed

transformer_qa reader raises ValueError for multilingual datasets #4052

MaksymDel opened this issue Apr 7, 2020 · 9 comments
Assignees
Labels
Models Issues related to the allennlp-models repo

Comments

@MaksymDel
Copy link
Contributor

Hi, I am trying to run transformer_qa model with xlm_roberta_base, but I am running into this block:
https://github.com/allenai/allennlp-models/blob/336b757fd3687e6a08676f9dd3e9f1625836c73a/allennlp_models/rc/common/reader_utils.py#L84-L85

from this block:
https://github.com/allenai/allennlp-models/blob/336b757fd3687e6a08676f9dd3e9f1625836c73a/allennlp_models/rc/transformer_qa/transformer_squad.py#L171-L178

and get ValueError while reading instance nr. 6007 (so many of them pass successfully).

It would be very helpful if you could give any pointers on what could be wrong. The only change I've made is that the transformer_model type now equals xlm-roberta-base.

So here is an example for tokenization of xlm-r vs roberta :

>>> from allennlp.data.tokenizers import PretrainedTransformerTokenizer
>>> xr = PretrainedTransformerTokenizer("xlm-roberta-base", add_special_tokens=False, calculate_character_offsets=True)
>>> r = PretrainedTransformerTokenizer("roberta-base", add_special_tokens=False, calculate_character_offsets=True)
>>> xr.tokenize("a am writingcode evety day")
[▁a, ▁am, ▁writing, code, ▁ev, ety, ▁day]
>>> r.tokenize("a am writingcode evety day")
[a, Ġam, Ġwriting, code, Ġeve, ty, Ġday]

Plain roberta runs without problems...

@MaksymDel
Copy link
Contributor Author

MaksymDel commented Apr 9, 2020

UPD: both XLM-R and mBert also raise this ValueError on MLQA dataset. Approximately 100 instances get affected...

@MaksymDel MaksymDel changed the title transformer_qa fails for XLM-R transformer_qa reader raises ValueError for some multilingual models / datasets Apr 9, 2020
@schmmd schmmd transferred this issue from allenai/allennlp-models Apr 10, 2020
@schmmd schmmd added the Models Issues related to the allennlp-models repo label Apr 10, 2020
@MaksymDel
Copy link
Contributor Author

MaksymDel commented Apr 18, 2020

So I just skipped these instances and trained models. It did not affect scores, so the issue is not critical to the multilingual models performance. I assume transformers repo also skips examples like this.

I think not much can be done here apart from just skipping corrupted instances instead of raising an exception in the code.

@dirkgr dirkgr self-assigned this Apr 21, 2020
@dirkgr
Copy link
Member

dirkgr commented Apr 21, 2020

If it's a problem with the input data, it should also fail with BERT though.

@dirkgr
Copy link
Member

dirkgr commented Apr 21, 2020

Haha, that is some great annotation. The question is "How many times was the album delayed?". Context is "West's perfectionism led The College Dropout to have its release postponed three times from its initial date in August 2003.". The annotation says the answer is the "3" in "2003" 🤣.

@dirkgr
Copy link
Member

dirkgr commented Apr 21, 2020

I put a fix for this at allenai/allennlp-models#24.

@MaksymDel
Copy link
Contributor Author

Hi @dirkgr, thanks for the fix!

Looks like I can now read English squad dataset tokenized by XLM-R tokenizer successfully.

However, when I try to read other languages data (e.g. Spanish), I get this ValueError that is left in the code (lines 107-108):

    if end_index >= len(token_offsets):
        raise ValueError(f"Character span %r outside the range of the given tokens.")

The data I use is this one (comes in squad format):
https://github.com/facebookresearch/MLQA

@MaksymDel MaksymDel changed the title transformer_qa reader raises ValueError for some multilingual models / datasets transformer_qa reader raises ValueError for multilingual datasets Apr 22, 2020
@dirkgr
Copy link
Member

dirkgr commented Apr 25, 2020

You are right! I put another fix here: allenai/allennlp-models#27

@dirkgr
Copy link
Member

dirkgr commented Apr 27, 2020

The second fix is now in too. I can train MLQA with this now.

@dirkgr
Copy link
Member

dirkgr commented Apr 27, 2020

I'll close this issue. Feel free to open a new one if you find more problems with it!

@dirkgr dirkgr closed this as completed Apr 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Models Issues related to the allennlp-models repo
Projects
None yet
Development

No branches or pull requests

3 participants