-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add phi-2 tokenizer #7300
Add phi-2 tokenizer #7300
Conversation
NotImplementedError: BPE pre-tokenizer was not recognized - update get_vocab_base_pre() |
@@ -469,6 +469,9 @@ def get_vocab_base_pre(self, tokenizer) -> str: | |||
if chkhsh == "27949a2493fc4a9f53f5b9b029c82689cfbe5d3a1929bb25e043089e28466de6": | |||
# ref: https://huggingface.co/jinaai/jina-embeddings-v2-base-de | |||
res = "jina-v2-de" | |||
if chkhsh == "fcace8b9cac38ce847670c970cd5892031a753a1ef381abd1d9af00f713da085": | |||
# ref: https://huggingface.co/microsoft/phi-2 | |||
res = "phi-2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new pre-tokenizer has to be handled in llama.cpp
:
Lines 4414 to 4475 in e18bc6a
// for now, only BPE models have pre-tokenizers | |
if (vocab.type == LLAMA_VOCAB_TYPE_BPE) { | |
if (tokenizer_pre.empty()) { | |
LLAMA_LOG_WARN("%s: missing pre-tokenizer type, using: 'default'\n", __func__); | |
LLAMA_LOG_WARN("%s: \n", __func__); | |
LLAMA_LOG_WARN("%s: ************************************ \n", __func__); | |
LLAMA_LOG_WARN("%s: GENERATION QUALITY WILL BE DEGRADED! \n", __func__); | |
LLAMA_LOG_WARN("%s: CONSIDER REGENERATING THE MODEL \n", __func__); | |
LLAMA_LOG_WARN("%s: ************************************ \n", __func__); | |
LLAMA_LOG_WARN("%s: \n", __func__); | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; | |
} else if ( | |
tokenizer_pre == "default") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; | |
} else if ( | |
tokenizer_pre == "llama3" || | |
tokenizer_pre == "llama-v3" || | |
tokenizer_pre == "llama-bpe") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_LLAMA3; | |
} else if ( | |
tokenizer_pre == "deepseek-llm") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM; | |
} else if ( | |
tokenizer_pre == "deepseek-coder") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER; | |
} else if ( | |
tokenizer_pre == "falcon") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_FALCON; | |
} else if ( | |
tokenizer_pre == "mpt") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_MPT; | |
} else if ( | |
tokenizer_pre == "starcoder") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_STARCODER; | |
} else if ( | |
tokenizer_pre == "gpt-2" || | |
tokenizer_pre == "jina-es" || | |
tokenizer_pre == "jina-de" || | |
tokenizer_pre == "jina-v2-es" || | |
tokenizer_pre == "jina-v2-de") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_GPT2; | |
} else if ( | |
tokenizer_pre == "refact") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_REFACT; | |
} else if ( | |
tokenizer_pre == "command-r") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_COMMAND_R; | |
} else if ( | |
tokenizer_pre == "qwen2") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_QWEN2; | |
} else if ( | |
tokenizer_pre == "olmo") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_OLMO; | |
} else if ( | |
tokenizer_pre == "dbrx") { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DBRX; | |
} else { | |
throw std::runtime_error(format("unknown pre-tokenizer type: '%s'", tokenizer_pre.c_str())); | |
} | |
} else { | |
vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_DEFAULT; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov Thats not necessary. I solved this already in #7219 and ##7117.
Hi @BramVanroy I was encouraging you in #7022 to test that HF and llama tokenization are identical. Here is a colab you could modify to try: https://colab.research.google.com/drive/1RYlEj2UhylYWyaASFo-LLATzZ8d29Z0T?usp=sharing |
I'm unsure what has changed but it seems that phi-2 models are working again so that's good news. Will close this one for now. |
Well, convert-hf-to-gguf-update.py still doesn't have a "phi-2" entry. The models should work with the default tokenizer, though (but that is the case for a long time). |
This snippet yields an error:
The proposed changes add support for phi-2, which uses CodeGenTokenizer, a BPE tokenizer.
closes #7022