-
Notifications
You must be signed in to change notification settings - Fork 2.2k
transformer_qa reader raises ValueError for multilingual datasets #4052
Comments
UPD: both XLM-R and mBert also raise this ValueError on MLQA dataset. Approximately 100 instances get affected... |
So I just skipped these instances and trained models. It did not affect scores, so the issue is not critical to the multilingual models performance. I assume I think not much can be done here apart from just skipping corrupted instances instead of raising an exception in the code. |
If it's a problem with the input data, it should also fail with BERT though. |
Haha, that is some great annotation. The question is "How many times was the album delayed?". Context is |
I put a fix for this at allenai/allennlp-models#24. |
Hi @dirkgr, thanks for the fix! Looks like I can now read English squad dataset tokenized by XLM-R tokenizer successfully. However, when I try to read other languages data (e.g. Spanish), I get this ValueError that is left in the code (lines 107-108): if end_index >= len(token_offsets):
raise ValueError(f"Character span %r outside the range of the given tokens.") The data I use is this one (comes in squad format): |
You are right! I put another fix here: allenai/allennlp-models#27 |
The second fix is now in too. I can train MLQA with this now. |
I'll close this issue. Feel free to open a new one if you find more problems with it! |
Hi, I am trying to run
transformer_qa
model withxlm_roberta_base
, but I am running into this block:https://github.com/allenai/allennlp-models/blob/336b757fd3687e6a08676f9dd3e9f1625836c73a/allennlp_models/rc/common/reader_utils.py#L84-L85
from this block:
https://github.com/allenai/allennlp-models/blob/336b757fd3687e6a08676f9dd3e9f1625836c73a/allennlp_models/rc/transformer_qa/transformer_squad.py#L171-L178
and get
ValueError
while reading instance nr. 6007 (so many of them pass successfully).It would be very helpful if you could give any pointers on what could be wrong. The only change I've made is that the transformer_model type now equals
xlm-roberta-base
.So here is an example for tokenization of
xlm-r vs roberta
:Plain roberta runs without problems...
The text was updated successfully, but these errors were encountered: