Skip to content

Scripts

embedding edited this page Apr 5, 2021 · 35 revisions

UER-py provides abundant tool scripts for pre-training models. This section firstly summarizes tool scripts and their functions, and then provides using examples of some scripts.

Script Function description
average_model.py Take the average of pre-trained models. A frequently-used ensemble strategy for deep learning models
build_vocab.py Build vocabulary (multi-processing supported)
check_model.py Check the model (single GPU or multiple GPUs)
cloze_test.py Randomly mask a word and predict it, top n words are returned
convert_bert_from_uer_to_google.py convert the BERT of UER format to Google format (TF)
convert_bert_from_uer_to_huggingface.py convert the BERT of UER format to Huggingface format (PyTorch)
convert_bert_from_google_to_uer.py convert the BERT of Google format (TF) to UER format
convert_bert_from_huggingface_to_uer.py convert the BERT of Huggingface format (PyTorch) to UER format
diff_vocab.py Compare two vocabularies
dynamic_vocab_adapter.py Change the pre-trained model according to the vocabulary. It can save memory in fine-tuning stage since task-specific vocabulary is much smaller than general-domain vocabulary
extract_embeddings.py extract the embedding of the pre-trained model
extract_features.py extract the hidden states of the last of the pre-trained model
topn_words_dep.py Finding nearest neighbours with context-independent word embedding
topn_words_indep.py Finding nearest neighbours with context-dependent word embedding

Cloze test

cloze_test.py uses MLM target to predict masked word. Top n words are returned. Cloze test can be used for operations such as data augmentation. The example of using cloze_test.py:

python3 scripts/cloze_test.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                              --config_path models/bert/base_config.json \
                              --test_path datasets/tencent_profile.txt --prediction_path output.txt \
                              --target bert

Notice that cloze_test.py only supports bert,mlm,and albert targets.

Feature extractor

The text is encoded into a fixed-length embedding by extract_features.py (through embedding, encoder, and pooling layers). The example of using extract_features.py:

python3 scripts/extract_features.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --config_path models/bert/base_config.json \
                                    --test_path datasets/tencent_profile.txt --prediction_path features.pt \
                                    --pooling first

CLS embedding (--pooling first) is commonly used as the text embedding. When cosine similarity is used to measure the relationship between two texts, CLS embedding is not a proper choice. According to recent work, it is necessary to perform whitening operation:

python3 scripts/extract_features.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --config_path models/bert/base_config.json \
                                    --test_path datasets/tencent_profile.txt --prediction_path features.pt \
                                    --pooling first --whitening_size 64

--whitening_size 64 表明会使用白化操作,并且向量经过变化后,维度变为64。如果不指定 --whitening_size ,则不会使用白化操作。推荐在特征抽取过程中使用白化操作。

Embedding extractor

extract_embeddings.py 从预训练模型权重embedding层中抽取词向量。这里的词向量指传统的上下文无关词向量。抽取出的词向量可以用于初始化其他模型(比如CNN)词向量层初始化。extract_embeddings.py 使用示例:

python3 scripts/extract_embeddings.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                      --word_embedding_path embeddings.txt

--word_embedding_path 指定输出词向量文件的路径。词向量文件的格式遵循这里,可以被主流项目直接使用。

Finding nearest neighbours

预训练模型能够产生高质量的词向量。传统的词向量(比如word2vec和GloVe)给定一个单词固定的向量(上下文无关向量)。然而,一词多义是人类语言中的常见现象。一个单词的意思依赖于其上下文。我们可以使用预训练模型的隐层去表示单词。值得注意的是大多数的中文预训练模型是基于字的。如果需要真正的词向量而不是字向量,用户需要下载基于词的BERT模型词典。上下文无关词向量以词搜词 scripts/topn_words_indep.py 使用示例(基于字和基于词):

python3 scripts/topn_words_indep.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                    --test_path target_words.txt

python3 scripts/topn_words_indep.py --load_model_path models/wiki_bert_word_model.bin --vocab_path models/wiki_word_vocab.txt \
                                    --test_path target_words.txt

上下文无关词向量来自于模型的embedding层, target_words.txt 的格式如下所示:

word-1
word-2
...
word-n

下面给出上下文相关词向量以词搜词 scripts/topn_words_dep.py 使用示例(基于字和基于词):

python3 scripts/topn_words_dep.py --load_model_path models/google_zh_model.bin --vocab_path models/google_zh_vocab.txt \
                                  --cand_vocab_path models/google_zh_vocab.txt --test_path target_words_with_sentences.txt --config_path models/bert/base_config.json \
                                  --batch_size 256 --seq_length 32 --tokenizer bert

python3 scripts/topn_words_dep.py --load_model_path models/bert_wiki_word_model.bin --vocab_path models/wiki_word_vocab.txt \
                                  --cand_vocab_path models/wiki_word_vocab.txt --test_path target_words_with_sentences.txt --config_path models/bert/base_config.json \
                                  --batch_size 256 --seq_length 32 --tokenizer space

我们把目标词替换成词典中其它的词(候选词),将序列送入网络。我们把目标词/候选词对应位置的隐层(最后一层)看作是目标词/候选词的上下文相关词向量。如果两个单词在特定上下文中的隐层向量接近,那么它们可能在特定的上下文中有相似的意思。
--cand_vocab_path 指定候选词文件的路径。由于需要将目标词替换成所有的候选词,然后经过网络,因此我们可以选择较小的候选词词典。
如果用户使用基于词的模型,需要对 target_words_with_sentences.txt 文件的句子进行分词。
target_words_with_sentences.txt 文件的格式如下:

word1 sent1
word2 sent2 
...
wordn sentn

单词与句子之间使用\t分隔。

Model average

average_models.py takes the average of multiple weights for probably more robust performance. The example of using average_models.py

python3 scripts/average_models.py --model_list_path models/book_review_model.bin-4000 models/book_review_model.bin-5000 \
                                  --output_model_path models/book_review_model.bin

Text generator (language model)

We could use generate_lm.py to generate text through language model. Given a few words, generate_lm.py can continue writing. The example of using generate_lm.py to load GPT-2 weight and continue writing:

python3 scripts/generate_lm.py --load_model_path models/gpt_model.bin --vocab_path models/google_zh_vocab.txt \
                               --test_path beginning.txt --prediction_path generated_text.txt \
                               --config_path models/gpt2/distil_config.json --seq_length 128 \
                               --embedding word_pos --remove_embedding_layernorm \
                               --encoder transformer --mask causal --layernorm_positioning pre \
                               --target lm --tie_weight

where beginning.txt contains the beginning of a text and generated_text.txt contains the text that the model writes.

Clone this wiki locally