Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regarding #1026 pull request #1158

Merged
merged 3 commits into from
Aug 30, 2019
Merged

regarding #1026 pull request #1158

merged 3 commits into from
Aug 30, 2019

Conversation

rabeehk
Copy link

@rabeehk rabeehk commented Aug 30, 2019

Dear Thomas,
This is regarding my #1026 pull request, here is my understanding of the reproducibility issue I was getting:

  • on line 451, in the codes tokenizer is reloaded without setting do_lower_case, then if you use both do_train+do_eval, you will get different results than if you do do_eval only on the same directory since if you use only do_eval only, tokenizer is read from line 408 where do_lower_case is considered.

  • the second issue I see is that if you do both do_train and do_eval you read tokenizer from the output_dir, but if you do only do_eval you read tokenizer from args.model_name_or_path which can be different and could results in different results, so this is better to reload the tokenizer once from output_dir during the evaluation and remove it from training part.

thanks.
Best regards,
Rabeeh

@thomwolf
Copy link
Member

thomwolf commented Aug 30, 2019

Oh I see what you mean, indeed that's a more general issue with saving and loading tokenizer with specific configuration parameters. This is actually also relevant to our work on XLM's tokenizer in #1092

@rabeehk
Copy link
Author

rabeehk commented Aug 30, 2019

Dear Thomas,
The pull request #1026 does not work unfortunately when using eval_all_check_points, and I was wondering if you could undo that merge, sorry for this, this new pull request here works for me.
thanks.

@thomwolf
Copy link
Member

Ok let's do that for now and I'll think about a more general way to save tokenizer configurations.

@thomwolf thomwolf merged commit b66e9b4 into huggingface:master Aug 30, 2019
@rabeehk
Copy link
Author

rabeehk commented Aug 30, 2019

awesome. thanks

@codecov-io
Copy link

Codecov Report

Merging #1158 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #1158   +/-   ##
======================================
  Coverage    80.7%   80.7%           
======================================
  Files          46      46           
  Lines        7411    7411           
======================================
  Hits         5981    5981           
  Misses       1430    1430

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0caab0...0a2fecd. Read the comment docs.

1 similar comment
@codecov-io
Copy link

codecov-io commented Aug 30, 2019

Codecov Report

Merging #1158 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@          Coverage Diff           @@
##           master   #1158   +/-   ##
======================================
  Coverage    80.7%   80.7%           
======================================
  Files          46      46           
  Lines        7411    7411           
======================================
  Hits         5981    5981           
  Misses       1430    1430

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e0caab0...0a2fecd. Read the comment docs.

@thomwolf
Copy link
Member

Addressing this up-stream with #1092

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants