Added cleaned configuration properties for tokenizer with serialization - improve tokenization of XLM #1092

shijie-wu · 2019-08-24T01:52:48Z

This PR improve the tokenization of XLM. It's mostly the same as the preprocessing in the original XLM. This PR also add use_lang_emb to config of XLM model, which makes adding the newly release XLM-17 & XLM-100 easier since both of them don't have language embedding.

Details on tokenization:

Introduce API change: Changing XLMTokenizer.tokenize(self, text) to XLMTokenizer.tokenize(text, lang='en')
New dependency:
- sacremoses: port of Moses
New optional dependencies:
- pythainlp: Thai tokenizer
- kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
- jieba: Chinese tokenizer *

* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.

Example of tokenization difference could be found here.

…anguages except zh, ja and th; Change API to allow specifying language in `tokenize`

thomwolf

Thanks a lot for all this work, it's great!

I've made a few comments on things to update. Mostly that we are only going to add sacremoses as required dependencies and raise error messages for the others.

I need to do a few modifications up-stream as mentioned in the comments to make it easier here.
Will do it in another PR so you can have a look.

pytorch_transformers/tokenization_xlm.py

requirements.txt

setup.py

pytorch_transformers/tokenization_xlm.py

codecov-io · 2019-08-27T12:11:18Z

Codecov Report

Merging #1092 into master will increase coverage by 0.09%.
The diff coverage is 78.2%.

@@            Coverage Diff             @@
##           master    #1092      +/-   ##
==========================================
+ Coverage   79.61%   79.71%   +0.09%     
==========================================
  Files          42       42              
  Lines        6898     7010     +112     
==========================================
+ Hits         5492     5588      +96     
- Misses       1406     1422      +16

Impacted Files	Coverage Δ
pytorch_transformers/modeling_xlm.py	`86.73% <100%> (+0.07%)`	⬆️
pytorch_transformers/tokenization_bert.py	`95.63% <100%> (+0.79%)`	⬆️
pytorch_transformers/tokenization_utils.py	`86.49% <100%> (+0.26%)`	⬆️
pytorch_transformers/tokenization_xlm.py	`83.4% <74.43%> (+0.33%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update df9d6ef...3871b8a. Read the comment docs.

thomwolf · 2019-08-30T15:12:45Z

Hi @shijie-wu,
So I've taken advantage of this PR to add a clean mechanism to set, save and reload tokenizer configurations.
This should fix in particular a recurring issue mentioned in #1158 and #1026 (failing to reload the lower casing configuration of the tokenizer) but more generally this is essential now for XLM's more complex language configuration.
Hope you don't mind me highjacking the PR.

thomwolf · 2019-08-30T21:15:33Z

Ok I think this is good to go. Let's merge it.

shijie-wu added 3 commits August 23, 2019 14:40

Tokenization behave the same as original XLM proprocessing for most l…

436ce07

…anguages except zh, ja and th; Change API to allow specifying language in `tokenize`

Add custom tokenizer for zh and ja

e85123d

Add use_lang_emb to config

f1b0187

thomwolf requested changes Aug 27, 2019

View reviewed changes

add kwargs to base encode function

a175a9d

shijie-wu and others added 2 commits August 27, 2019 20:03

Match order of casing in OSS XLM; Improve document; Clean up dependency

ca4baf8

Added option to setup pretrained tokenizer arguments

82462c5

thomwolf mentioned this pull request Aug 30, 2019

regarding #1026 pull request #1158

Merged

thomwolf added 2 commits August 30, 2019 16:26

adding 17 and 100 xlm models

8678ff8

adding xlm 17 and 100 models and config on aws

3871b8a

thomwolf changed the title ~~Improve support of XLM~~ Generalize pretrained model configuration properties for tokenizer - improve tokenization of XLM Aug 30, 2019

thomwolf added 2 commits August 30, 2019 16:55

saving and reloading tokenizer configurations

88111de

added test and debug tokenizer configuration serialization

69da972

thomwolf changed the title ~~Generalize pretrained model configuration properties for tokenizer - improve tokenization of XLM~~ Added cleaned configuration properties for tokenizer with serialization - improve tokenization of XLM Aug 30, 2019

thomwolf and others added 2 commits August 30, 2019 17:15

Merge branch 'master' into xlm-tokenization

cd65c41

fix tokenizers serialization

7044ed6

thomwolf merged commit d2f21f0 into huggingface:master Aug 30, 2019

shijie-wu deleted the xlm-tokenization branch September 4, 2019 01:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added cleaned configuration properties for tokenizer with serialization - improve tokenization of XLM #1092

Added cleaned configuration properties for tokenizer with serialization - improve tokenization of XLM #1092

shijie-wu commented Aug 24, 2019 •

edited by thomwolf

Loading

thomwolf left a comment

codecov-io commented Aug 27, 2019 •

edited

Loading

thomwolf commented Aug 30, 2019 •

edited

Loading

thomwolf commented Aug 30, 2019

Added cleaned configuration properties for tokenizer with serialization - improve tokenization of XLM #1092

Added cleaned configuration properties for tokenizer with serialization - improve tokenization of XLM #1092

Conversation

shijie-wu commented Aug 24, 2019 • edited by thomwolf Loading

thomwolf left a comment

Choose a reason for hiding this comment

codecov-io commented Aug 27, 2019 • edited Loading

Codecov Report

thomwolf commented Aug 30, 2019 • edited Loading

thomwolf commented Aug 30, 2019

shijie-wu commented Aug 24, 2019 •

edited by thomwolf

Loading

codecov-io commented Aug 27, 2019 •

edited

Loading

thomwolf commented Aug 30, 2019 •

edited

Loading