Update special token handling in conversion scripts for gpt2 derived tokenizers #3746

Galunid · 2023-10-23T14:59:45Z

It's not tested yet, I'll test it by updating #3742 later today

Change discussed in #3730

Galunid · 2023-10-23T15:00:15Z

Baichuan does conversion differently, I'll have to look a bit more into it

goerch · 2023-10-23T15:24:43Z

Baichuan does conversion differently, I'll have to look a bit more into it

Is there immediate need? Tests work for me (with an rather old download) without any change.

Galunid · 2023-10-23T15:41:41Z

Not really, I took a look and I think it's best to leave it as it is, since it seems to be working.

Galunid · 2023-10-23T15:42:16Z

I converted all the other models from this PR and the tokenizer tests pass, so it's probably good to go

maddes8cht · 2023-10-25T20:47:18Z

I am getting this error:

python convert-mpt-hf-to-gguf.py e:\hf\mpt-7b-storywriter\
gguf: loading model mpt-7b-storywriter
gguf: found 2 model parts
This gguf file is for Little Endian only
gguf: get model metadata
gguf: get tokenizer metadata
gguf: get gpt2 tokenizer vocab
Traceback (most recent call last):
  File "e:\hf\llama.cpp\convert-mpt-hf-to-gguf.py", line 140, in <module>
    if tokenizer.added_tokens_decoder[i].special:
AttributeError: 'GPTNeoXTokenizerFast' object has no attribute 'added_tokens_decoder'

Before this PR the convert-mpt-hf-to-gguf.py script worked for me.

I am running on Windows 10 in a conda env with python 3.10.13

cebtenzzre · 2023-10-25T21:16:34Z

I am getting this error:

You need to update your 'transformers' package to at least v4.34.0.

* master: (350 commits) speculative : ensure draft and target model vocab matches (ggerganov#3812) llama : correctly report GGUFv3 format (ggerganov#3818) simple : fix batch handling (ggerganov#3803) cuda : improve text-generation and batched decoding performance (ggerganov#3776) server : do not release slot on image input (ggerganov#3798) batched-bench : print params at start log : disable pid in log filenames server : add parameter -tb N, --threads-batch N (ggerganov#3584) (ggerganov#3768) server : do not block system prompt update (ggerganov#3767) sync : ggml (conv ops + cuda MSVC fixes) (ggerganov#3765) cmake : add missed dependencies (ggerganov#3763) cuda : add batched cuBLAS GEMM for faster attention (ggerganov#3749) Add more tokenizer tests (ggerganov#3742) metal : handle ggml_scale for n%4 != 0 (close ggerganov#3754) Revert "make : add optional CUDA_NATIVE_ARCH (ggerganov#2482)" issues : separate bug and enhancement template + no default title (ggerganov#3748) Update special token handling in conversion scripts for gpt2 derived tokenizers (ggerganov#3746) llama : remove token functions with `context` args in favor of `model` (ggerganov#3720) Fix baichuan convert script not detecing model (ggerganov#3739) make : add optional CUDA_NATIVE_ARCH (ggerganov#2482) ...

Galunid added 2 commits October 23, 2023 16:55

Update special token handling

217f82e

Add mpt

c04ddb6

goerch approved these changes Oct 23, 2023

View reviewed changes

goerch merged commit 69a6735 into ggerganov:master Oct 23, 2023
6 checks passed

Galunid deleted the update-conversion-scripts branch October 23, 2023 19:50

maddes8cht mentioned this pull request Oct 25, 2023

#3746 introduces error in convert-mpt-hf-to-gguf.py #3783

Closed

cebtenzzre mentioned this pull request Oct 31, 2023

convert : restore Falcon vocab padding #3864

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update special token handling in conversion scripts for gpt2 derived tokenizers #3746

Update special token handling in conversion scripts for gpt2 derived tokenizers #3746

Galunid commented Oct 23, 2023

Galunid commented Oct 23, 2023

goerch commented Oct 23, 2023

Galunid commented Oct 23, 2023

Galunid commented Oct 23, 2023

maddes8cht commented Oct 25, 2023

cebtenzzre commented Oct 25, 2023

Update special token handling in conversion scripts for gpt2 derived tokenizers #3746

Update special token handling in conversion scripts for gpt2 derived tokenizers #3746

Conversation

Galunid commented Oct 23, 2023

Galunid commented Oct 23, 2023

goerch commented Oct 23, 2023

Galunid commented Oct 23, 2023

Galunid commented Oct 23, 2023

maddes8cht commented Oct 25, 2023

cebtenzzre commented Oct 25, 2023