regarding #1026 pull request #1158

rabeehk · 2019-08-30T13:48:25Z

Dear Thomas,
This is regarding my #1026 pull request, here is my understanding of the reproducibility issue I was getting:

on line 451, in the codes tokenizer is reloaded without setting do_lower_case, then if you use both do_train+do_eval, you will get different results than if you do do_eval only on the same directory since if you use only do_eval only, tokenizer is read from line 408 where do_lower_case is considered.
the second issue I see is that if you do both do_train and do_eval you read tokenizer from the output_dir, but if you do only do_eval you read tokenizer from args.model_name_or_path which can be different and could results in different results, so this is better to reload the tokenizer once from output_dir during the evaluation and remove it from training part.

thanks.
Best regards,
Rabeeh

…on part

thomwolf · 2019-08-30T14:04:16Z

Oh I see what you mean, indeed that's a more general issue with saving and loading tokenizer with specific configuration parameters. This is actually also relevant to our work on XLM's tokenizer in #1092

rabeehk · 2019-08-30T14:25:08Z

Dear Thomas,
The pull request #1026 does not work unfortunately when using eval_all_check_points, and I was wondering if you could undo that merge, sorry for this, this new pull request here works for me.
thanks.

thomwolf · 2019-08-30T14:29:24Z

Ok let's do that for now and I'll think about a more general way to save tokenizer configurations.

rabeehk · 2019-08-30T14:33:20Z

awesome. thanks

codecov-io · 2019-08-30T14:36:17Z

Codecov Report

Merging #1158 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #1158   +/-   ##
======================================
  Coverage    80.7%   80.7%           
======================================
  Files          46      46           
  Lines        7411    7411           
======================================
  Hits         5981    5981           
  Misses       1430    1430

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0caab0...0a2fecd. Read the comment docs.

codecov-io · 2019-08-30T14:36:18Z

Codecov Report

Merging #1158 into master will not change coverage.
The diff coverage is n/a.

@@          Coverage Diff           @@
##           master   #1158   +/-   ##
======================================
  Coverage    80.7%   80.7%           
======================================
  Files          46      46           
  Lines        7411    7411           
======================================
  Hits         5981    5981           
  Misses       1430    1430

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0caab0...0a2fecd. Read the comment docs.

thomwolf · 2019-08-30T15:17:16Z

Addressing this up-stream with #1092

Rabeeh KARIMI added 2 commits August 30, 2019 15:34

updated tokenizer loading for addressing reproducibility issues

350bb6b

remove reloading tokenizer in the training, adding it to the evaluati…

39eb31e

…on part

Merge branch 'master' into master

0a2fecd

thomwolf merged commit b66e9b4 into huggingface:master Aug 30, 2019

thomwolf mentioned this pull request Aug 30, 2019

Added cleaned configuration properties for tokenizer with serialization - improve tokenization of XLM #1092

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

regarding #1026 pull request #1158

regarding #1026 pull request #1158

rabeehk commented Aug 30, 2019

thomwolf commented Aug 30, 2019 •

edited

Loading

rabeehk commented Aug 30, 2019

thomwolf commented Aug 30, 2019

rabeehk commented Aug 30, 2019

codecov-io commented Aug 30, 2019

codecov-io commented Aug 30, 2019 •

edited

Loading

thomwolf commented Aug 30, 2019

regarding #1026 pull request #1158

regarding #1026 pull request #1158

Conversation

rabeehk commented Aug 30, 2019

thomwolf commented Aug 30, 2019 • edited Loading

rabeehk commented Aug 30, 2019

thomwolf commented Aug 30, 2019

rabeehk commented Aug 30, 2019

codecov-io commented Aug 30, 2019

Codecov Report

codecov-io commented Aug 30, 2019 • edited Loading

Codecov Report

thomwolf commented Aug 30, 2019

thomwolf commented Aug 30, 2019 •

edited

Loading

codecov-io commented Aug 30, 2019 •

edited

Loading