Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug fixes] update chatglm tokenizer #7797

Merged
merged 5 commits into from
Jan 12, 2024

Conversation

wj-Mcat
Copy link
Contributor

@wj-Mcat wj-Mcat commented Jan 8, 2024

PR types

Bug fixes

PR changes

Tokenizer

Description

更新 chatglm2、3 的 tokenzier

Copy link

paddle-bot bot commented Jan 8, 2024

Thanks for your contribution!

@wj-Mcat
Copy link
Contributor Author

wj-Mcat commented Jan 9, 2024

当前代码测试 chatglm 的 tokenizer

from paddlenlp.transformers import AutoTokenizer

def print_special_tokens(tokenizer):
    tokens = ["[MASK]", "[gMASK]", "[sMASK]", "sop", "eop"]
    role_special_tokens = ["<|system|>", "<|user|>", "<|assistant|>", "<|observation|>"]
    tokens = tokens + role_special_tokens

    for token in tokens:
        print("============================================================")
        print("token ->", token)
        tokens = tokenizer.tokenize(token)
        print("tokens->", tokens)
        ids = tokenizer.convert_tokens_to_ids([token])
        print("ids    ->", ids)

model_names = ["THUDM/chatglm-6b-v1.1", "THUDM/chatglm2-6b", "THUDM/chatglm3-6b"]
for model_name in model_names:
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    print_special_tokens(tokenizer)

日志

/root/paddlejob/workspace/envs_paddle/wjj/lib/python3.8/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils.
  warnings.warn("Setuptools is replacing distutils.")
�[32m[2024-01-10 15:07:05,338] [    INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,339] [    INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm.tokenizer.ChatGLMTokenizer'> to load 'THUDM/chatglm-6b-v1.1'.�[0m
�[32m[2024-01-10 15:07:05,339] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/ice_text.model�[0m
�[32m[2024-01-10 15:07:05,339] [    INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/�[0m
�[33m[2024-01-10 15:07:05,388] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/added_tokens.json> not exist�[0m
�[32m[2024-01-10 15:07:05,389] [    INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/�[0m
�[33m[2024-01-10 15:07:05,425] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm-6b-v1.1/special_tokens_map.json> not exist�[0m
�[32m[2024-01-10 15:07:05,425] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,425] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm-6b-v1.1/chat_template.json�[0m
============================================================
token -> [MASK]
tokens-> ['[MASK]']
ids    -> [130000]
============================================================
token -> [gMASK]
tokens-> ['▁[', 'g', 'MASK', ']']
ids    -> [130001]
============================================================
token -> [sMASK]
tokens-> ['▁[', 's', 'MASK', ']']
ids    -> [130002]
============================================================
token -> sop
tokens-> ['▁so', 'p']
ids    -> [0]
============================================================
token -> eop
tokens-> ['▁e', 'op']
ids    -> [0]
============================================================
token -> <|system|>
tokens-> ['▁<', '|', 'system', '|', '>']
ids    -> [0]
============================================================
token -> <|user|>
tokens-> ['▁<', '|', 'user', '|', '>']
ids    -> [0]
============================================================
token -> <|assistant|>
tokens-> ['▁<', '|', 'assistant', '|', '>']
ids    -> [0]
============================================================
token -> <|observation|>
tokens-> ['▁<', '|', 'observation', '|', '>']
ids    -> [0]
�[32m[2024-01-10 15:07:05,610] [    INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,613] [    INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm_v2.tokenizer.ChatGLMv2Tokenizer'> to load 'THUDM/chatglm2-6b'.�[0m
�[32m[2024-01-10 15:07:05,614] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer.model�[0m
�[32m[2024-01-10 15:07:05,614] [    INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm2-6b/�[0m
�[33m[2024-01-10 15:07:05,649] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/added_tokens.json> not exist�[0m
�[32m[2024-01-10 15:07:05,649] [    INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm2-6b/�[0m
�[33m[2024-01-10 15:07:05,693] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm2-6b/special_tokens_map.json> not exist�[0m
�[32m[2024-01-10 15:07:05,694] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,694] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm2-6b/chat_template.json�[0m
============================================================
token -> [MASK]
tokens-> ['[MASK]']
ids    -> [64789]
============================================================
token -> [gMASK]
tokens-> ['[gMASK]']
ids    -> [64790]
============================================================
token -> [sMASK]
tokens-> ['[sMASK]']
ids    -> [64791]
============================================================
token -> sop
tokens-> ['sop']
ids    -> [64792]
============================================================
token -> eop
tokens-> ['eop']
ids    -> [64793]
============================================================
token -> <|system|>
tokens-> ['▁<', '|', 'system', '|', '>']
ids    -> [0]
============================================================
token -> <|user|>
tokens-> ['▁<', '|', 'user', '|', '>']
ids    -> [0]
============================================================
token -> <|assistant|>
tokens-> ['▁<', '|', 'ass', 'istant', '|', '>']
ids    -> [0]
============================================================
token -> <|observation|>
tokens-> ['▁<', '|', 'ob', 'serv', 'ation', '|', '>']
ids    -> [0]
�[32m[2024-01-10 15:07:05,729] [    INFO]�[0m - Found /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,729] [    INFO]�[0m - We are using <class 'paddlenlp.transformers.chatglm_v2.tokenizer.ChatGLMv2Tokenizer'> to load 'THUDM/chatglm3-6b'.�[0m
�[32m[2024-01-10 15:07:05,730] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer.model�[0m
�[32m[2024-01-10 15:07:05,730] [    INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/added_tokens.json and saved to /root/.paddlenlp/models/THUDM/chatglm3-6b/�[0m
�[33m[2024-01-10 15:07:05,770] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/added_tokens.json> not exist�[0m
�[32m[2024-01-10 15:07:05,771] [    INFO]�[0m - Downloading https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/special_tokens_map.json and saved to /root/.paddlenlp/models/THUDM/chatglm3-6b/�[0m
�[33m[2024-01-10 15:07:05,806] [ WARNING]�[0m - file<https://bj.bcebos.com/paddlenlp/models/community/THUDM/chatglm3-6b/special_tokens_map.json> not exist�[0m
�[32m[2024-01-10 15:07:05,806] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/tokenizer_config.json�[0m
�[32m[2024-01-10 15:07:05,806] [    INFO]�[0m - Already cached /root/.paddlenlp/models/THUDM/chatglm3-6b/chat_template.json�[0m
============================================================
token -> [MASK]
tokens-> ['[MASK]']
ids    -> [64789]
============================================================
token -> [gMASK]
tokens-> ['[gMASK]']
ids    -> [64790]
============================================================
token -> [sMASK]
tokens-> ['[sMASK]']
ids    -> [64791]
============================================================
token -> sop
tokens-> ['sop']
ids    -> [64792]
============================================================
token -> eop
tokens-> ['eop']
ids    -> [64793]
============================================================
token -> <|system|>
tokens-> ['<|system|>']
ids    -> [64794]
============================================================
token -> <|user|>
tokens-> ['<|user|>']
ids    -> [64795]
============================================================
token -> <|assistant|>
tokens-> ['<|assistant|>']
ids    -> [64796]
============================================================
token -> <|observation|>
tokens-> ['<|observation|>']
ids    -> [64797]

Copy link

codecov bot commented Jan 10, 2024

Codecov Report

Attention: 2 lines in your changes are missing coverage. Please review.

Comparison is base (17acf22) 57.42% compared to head (ebe8c6d) 56.96%.
Report is 19 commits behind head on develop.

Files Patch % Lines
paddlenlp/transformers/chatglm_v2/tokenizer.py 90.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7797      +/-   ##
===========================================
- Coverage    57.42%   56.96%   -0.46%     
===========================================
  Files          585      587       +2     
  Lines        87976    88647     +671     
===========================================
- Hits         50517    50498      -19     
- Misses       37459    38149     +690     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@wj-Mcat wj-Mcat marked this pull request as ready for review January 12, 2024 07:35
Copy link
Member

@JunnYu JunnYu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@JunnYu JunnYu merged commit b44f888 into PaddlePaddle:develop Jan 12, 2024
9 of 10 checks passed
JunnYu pushed a commit that referenced this pull request Jan 12, 2024
* update chatglm tokenizer

* update chatglm2 tokenizer

* update chatglm2 tokenizer

* update max & src slider

* add chatglm2 tokenizer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants