-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug fixes] update chatglm tokenizer #7797
[Bug fixes] update chatglm tokenizer #7797
Conversation
Thanks for your contribution! |
当前代码测试 chatglm 的 tokenizer from paddlenlp.transformers import AutoTokenizer
def print_special_tokens(tokenizer):
tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop"]
role_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"]
tokens = tokens + role_special_tokens
for token in tokens:
print("============================================================")
print("token ->", token)
tokens = tokenizer.tokenize(token)
print("tokens->", tokens)
ids = tokenizer.convert_tokens_to_ids([token])
print("ids ->", ids)
model_names = ["THUDM/chatglm-6b-v1.1", "THUDM/chatglm2-6b", "THUDM/chatglm3-6b"]
for model_name in model_names:
tokenizer = AutoTokenizer.from_pretrained(model_name)
print_special_tokens(tokenizer) 日志/root/paddlejob/workspace/envs_paddle/wjj/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
warnings.warn("Setuptools is replacing distutils.")
�[32m[2024-01-10 15:07:05,338] [ INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,339] [ INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm.tokenizer.ChatGLMTokenizer'> to load 'THUDM/chatglm-6b-v1.1'.�[0m
�[32m[2024-01-10 15:07:05,339] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/ice_text.model�[0m
�[32m[2024-01-10 15:07:05,339] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/�[0m
�[33m[2024-01-10 15:07:05,388] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/added_tokens.json> not exist�[0m
�[32m[2024-01-10 15:07:05,389] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/�[0m
�[33m[2024-01-10 15:07:05,425] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/special_tokens_map.json> not exist�[0m
�[32m[2024-01-10 15:07:05,425] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,425] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/chat_template.json�[0m
============================================================
token -> [MASK]
tokens-> ['[MASK]']
ids -> [130000]
============================================================
token -> [gMASK]
tokens-> ['▁[', 'g', 'MASK', ']']
ids -> [130001]
============================================================
token -> [sMASK]
tokens-> ['▁[', 's', 'MASK', ']']
ids -> [130002]
============================================================
token -> sop
tokens-> ['▁so', 'p']
ids -> [0]
============================================================
token -> eop
tokens-> ['▁e', 'op']
ids -> [0]
============================================================
token -> <|system|>
tokens-> ['▁<', '|', 'system', '|', '>']
ids -> [0]
============================================================
token -> <|user|>
tokens-> ['▁<', '|', 'user', '|', '>']
ids -> [0]
============================================================
token -> <|assistant|>
tokens-> ['▁<', '|', 'assistant', '|', '>']
ids -> [0]
============================================================
token -> <|observation|>
tokens-> ['▁<', '|', 'observation', '|', '>']
ids -> [0]
�[32m[2024-01-10 15:07:05,610] [ INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,613] [ INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm_v2.tokenizer.ChatGLMv2Tokenizer'> to load 'THUDM/chatglm2-6b'.�[0m
�[32m[2024-01-10 15:07:05,614] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer.model�[0m
�[32m[2024-01-10 15:07:05,614] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm2-6b/�[0m
�[33m[2024-01-10 15:07:05,649] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/added_tokens.json> not exist�[0m
�[32m[2024-01-10 15:07:05,649] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm2-6b/�[0m
�[33m[2024-01-10 15:07:05,693] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/special_tokens_map.json> not exist�[0m
�[32m[2024-01-10 15:07:05,694] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,694] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/chat_template.json�[0m
============================================================
token -> [MASK]
tokens-> ['[MASK]']
ids -> [64789]
============================================================
token -> [gMASK]
tokens-> ['[gMASK]']
ids -> [64790]
============================================================
token -> [sMASK]
tokens-> ['[sMASK]']
ids -> [64791]
============================================================
token -> sop
tokens-> ['sop']
ids -> [64792]
============================================================
token -> eop
tokens-> ['eop']
ids -> [64793]
============================================================
token -> <|system|>
tokens-> ['▁<', '|', 'system', '|', '>']
ids -> [0]
============================================================
token -> <|user|>
tokens-> ['▁<', '|', 'user', '|', '>']
ids -> [0]
============================================================
token -> <|assistant|>
tokens-> ['▁<', '|', 'ass', 'istant', '|', '>']
ids -> [0]
============================================================
token -> <|observation|>
tokens-> ['▁<', '|', 'ob', 'serv', 'ation', '|', '>']
ids -> [0]
�[32m[2024-01-10 15:07:05,729] [ INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,729] [ INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm_v2.tokenizer.ChatGLMv2Tokenizer'> to load 'THUDM/chatglm3-6b'.�[0m
�[32m[2024-01-10 15:07:05,730] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer.model�[0m
�[32m[2024-01-10 15:07:05,730] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm3-6b/�[0m
�[33m[2024-01-10 15:07:05,770] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/added_tokens.json> not exist�[0m
�[32m[2024-01-10 15:07:05,771] [ INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm3-6b/�[0m
�[33m[2024-01-10 15:07:05,806] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/special_tokens_map.json> not exist�[0m
�[32m[2024-01-10 15:07:05,806] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,806] [ INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/chat_template.json�[0m
============================================================
token -> [MASK]
tokens-> ['[MASK]']
ids -> [64789]
============================================================
token -> [gMASK]
tokens-> ['[gMASK]']
ids -> [64790]
============================================================
token -> [sMASK]
tokens-> ['[sMASK]']
ids -> [64791]
============================================================
token -> sop
tokens-> ['sop']
ids -> [64792]
============================================================
token -> eop
tokens-> ['eop']
ids -> [64793]
============================================================
token -> <|system|>
tokens-> ['<|system|>']
ids -> [64794]
============================================================
token -> <|user|>
tokens-> ['<|user|>']
ids -> [64795]
============================================================
token -> <|assistant|>
tokens-> ['<|assistant|>']
ids -> [64796]
============================================================
token -> <|observation|>
tokens-> ['<|observation|>']
ids -> [64797] |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #7797 +/- ##
===========================================
- Coverage 57.42% 56.96% -0.46%
===========================================
Files 585 587 +2
Lines 87976 88647 +671
===========================================
- Hits 50517 50498 -19
- Misses 37459 38149 +690 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* update chatglm tokenizer * update chatglm2 tokenizer * update chatglm2 tokenizer * update max & src slider * add chatglm2 tokenizer
PR types
Bug fixes
PR changes
Tokenizer
Description
更新 chatglm2、3 的 tokenzier