Skip to content

Tokenization and vocabulary

embedding edited this page Apr 18, 2021 · 7 revisions

UER-py supports multiple tokenization strategies. The most commonly used strategy is BertTokenizer (which is also the default strategy). There are two ways of using BertTokenizer: the first is to specify the vocabulary path through --vocab_path and then use BERT's original tokenization strategy to segment sentences according to the vocabulary; the second is to specify the sentencepiece model path by --spm_model_path . We import sentencepiece, load the sentencepiece model, and segment the sentence. If user specifies --spm_model_path, sentencepiece is used for tokenization. Otherwise, user must specify --vocab_path and BERT's original tokenization strategy is used for tokenization.

In addition, the project provides CharTokenizer and SpaceTokenizer. CharTokenizer tokenizes the text by character. If the text is all Chinese character, CharTokenizer and BertTokenizer are equivalent. CharTokenizer is simple, and is faster than BertTokenizer. SpaceTokenizer separates the text by space. One can preprocess the text in advance (such as word segmentation), separate the text by space, and then use SpaceTokenizer. For CharTokenizer and SpaceTokenizer, if user specifies --spm_model_path, then the vocabulary in sentencepiece model is used. Otherwise, user must specify the vocabulary through --vocab_path.

The pre-processing, pre-training, and fine-tuning stages all need vocabulary, which is provided through --vocab_path or --smp_model_path. If users use their own vocabularies, they need to ensure that: 1) The ID of the padding character is 0; 2) The starting character, separator character, and mask character are "[CLS]", "[SEP]", "[MASK]"; 3) If --vocab_path is specified, the unknown character is "[UNK]". If --spm_model_path is spcified, the unknown character is "<unk>" .

Clone this wiki locally