DistilBERT, GPT-2 Large, XLM multilingual models, torch.hub, bug fixes
New model architecture: DistilBERT
Huggingface's new transformer architecture, DistilBERT described in Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.
This new model architecture comes with two pretrained checkpoints:
distilbert-base-uncased
: the base DistilBert modeldistilbert-base-uncased-distilled-squad
: DistilBert model fine-tuned with distillation on SQuAD.
New GPT2 checkpoint: GPT-2 large (774M parameters)
The third OpenAI GPT-2 checkpoint is available in the library: 774M parameters, 36 layers, and 20 heads.
New XLM multilingual checkpoints: 17 & 100 languages
We have added two new XLM models in 17 and 100 languages which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.
Back on torch.hub
with all the architectures
Pytorch-Transformers torch.hub
interface is based on Auto-Models which are generic classes designed to be instantiated using from_pretrained()
in a model architecture guessed from the pretrained checkpoint name (ex AutoModel.from_pretrained('bert-base-uncased') will instantiate a
BertModeland load the 'bert-case-uncased' checkpoint in it). They are currently 4 classes of Auto-Models:
AutoModel,
AutoModelWithLMHead,
AutoModelForSequenceClassificationand
AutoModelForQuestionAnswering`.
New dependency: sacremoses
Support for XLM is improved by carefully reproducing the original tokenization workflow (work by @shijie-wu in #1092). We now rely on sacremoses
, a python port of Moses tokenizer, truecaser and normalizer by @alvations, for XLM word tokenization.
In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:
- pythainlp: Thai tokenizer
- kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
- jieba: Chinese tokenizer *
* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.
Bug fixes and improvements to the library modules
- Bertology script has seen major improvements (@tuvuumass )
- Iterative tokenization now faster and accept arbitrary numbers of added tokens (@samvelyan)
- Added RoBERTa to AutoModels and AutoTokenizers (@LysandreJik )
- Added GPT-2 Large 774M model (@thomwolf )
- Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (@LysandreJik @thomwolf )
- Multi-GPU training has been patched (@FeiWang96 )
- Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (@Morizeyao, @adai183 )
- Updated the in-depth BERT fine-tuning scripts to
pytorch-transformers
(@Morizeyao ) - Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (@LysandreJik @thomwolf)
- Add
proxies
andforce_download
options tofrom_pretrained()
method to be able to use proxies and update cached models/tokenizers (@thomwolf) - Add shortcut to each special tokens with
_id
properties (e.g.tokenizer.cls_token_id
for the id in the vocabulary oftokenizer.cls_token
) (@thomwolf) - Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by fairseq authors) (@thomwolf)
- Fix and clean up byte-level BPE tests (@thomwolf)
- Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (@LysandreJik )
- Fix a warning raised when the decode method is called for a model with no
sep_token
like GPT-2 (@LysandreJik ) - Updated the tokenizers saving method (@boy2000-007man)
- SpaCy tokenizers have been updated in the tokenizers (@GuillemGSubies )
- Stable
EnvironmentErrors
have been added to utility files (@abhishekraok ) - Fixed distributed barrier hang (@VictorSanh )
- Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (@LysandreJik )
- Change layer norm code to PyTorch's native layer norm (@dhpollack)
- Improved tokenization for XLM for multilingual inputs (@shijie-wu)
- Add language input and access to language to id conversion in XLM tokenizer (@thomwolf)
- Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (@thomwolf)
- Added new AutoModels:
AutoModelWithLMHead
,AutoModelForSequenceClassification
,AutoModelForQuestionAnswering
(@LysandreJik) - Torch.hub is now based on AutoModels (@LysandreJik @thomwolf)
- Fix Transformer-XL attention mask dtype to be bool (@CrafterKolyan)
- Adding DistilBert model architecture and checkpoints (@VictorSanh @LysandreJik @thomwolf)
- Fixes to DistilBert configuration and training script (@stefan-it)
- Fix XLNet attention mask for fp16 (@ziliwang)
- Documentation auto-deploy (@LysandreJik)
- Fix to add a tuple of tokens (@epwalsh)
- Update fp16 apex implementation in scripts (@anhnt170489)
- Fix XLNet bias resizing when adding/removing tokens (@LysandreJik)
- Fix tokenizer reloading in example scripts (@rabeehk)
- Fix byte-level decoding error when using added tokens (@thomwolf @LysandreJik)
- Fix epsilon value in RoBERTa pretrained checkpoints (@julien-c)