-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regarding #1026 pull request #1158
Conversation
Oh I see what you mean, indeed that's a more general issue with saving and loading tokenizer with specific configuration parameters. This is actually also relevant to our work on XLM's tokenizer in #1092 |
Dear Thomas, |
Ok let's do that for now and I'll think about a more general way to save tokenizer configurations. |
awesome. thanks |
Codecov Report
@@ Coverage Diff @@
## master #1158 +/- ##
======================================
Coverage 80.7% 80.7%
======================================
Files 46 46
Lines 7411 7411
======================================
Hits 5981 5981
Misses 1430 1430 Continue to review full report at Codecov.
|
1 similar comment
Codecov Report
@@ Coverage Diff @@
## master #1158 +/- ##
======================================
Coverage 80.7% 80.7%
======================================
Files 46 46
Lines 7411 7411
======================================
Hits 5981 5981
Misses 1430 1430 Continue to review full report at Codecov.
|
Addressing this up-stream with #1092 |
Dear Thomas,
This is regarding my #1026 pull request, here is my understanding of the reproducibility issue I was getting:
on line 451, in the codes tokenizer is reloaded without setting do_lower_case, then if you use both do_train+do_eval, you will get different results than if you do do_eval only on the same directory since if you use only do_eval only, tokenizer is read from line 408 where do_lower_case is considered.
the second issue I see is that if you do both do_train and do_eval you read tokenizer from the output_dir, but if you do only do_eval you read tokenizer from args.model_name_or_path which can be different and could results in different results, so this is better to reload the tokenizer once from output_dir during the evaluation and remove it from training part.
thanks.
Best regards,
Rabeeh