-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Viking tokenizer support #7328
Viking tokenizer support #7328
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The proper way is to update convert-hf-to-gguf-update.py
and validate that the tokenization tests pass
ab842e3
to
69f815d
Compare
@ggerganov Thanks for the pointers. I think I'm doing things more correctly in this iteration, but the
among other output. cc @jonabur (of the Viking team) |
If "pre_tokenizer": {
"type": "Sequence",
"pretokenizers": [
{
"type": "Split",
"pattern": {
"Regex": " ?[^(\\s|[.,!?…。,、।۔،])]+"
},
"behavior": "Isolated",
"invert": false
},
{
"type": "Digits",
"individual_digits": true
},
{
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": false
}
]
}, So you have to implement use the respective regexes in Lines 12287 to 12378 in 27b0406
|
We've also run into this same issue providing a gguf'd version of Poro, using the same tokenizer regex. @akx have you made any progress converting the regex format? |
@jonabur I'm working on it (again) now :)
EDIT: evidently it is, same |
I'm not sure what to make of the failing tokenizer tests even after adding in the regexp (and I suppose "digits" means another regexp should be splitting all digits into separate tokens?) The actual detokenized output matches the expected output, but the token sequence doesn't. The first commit in this PR now improves the output of -expected tokens:
+got tokens:
746 '
'
2392 '
'
55899 '
@@ -169,37 +169,43 @@
3395 '""'
30917 '......'
17846 '!!!!'
2420 '!!'
13728 '????'
3963 '??'
- 9873 ' I've'
+ 383 ' I'
+ 7029 ''ve'
1912 ' been'
- 37493 ' 't'
+ 630 ' ''
+ 107 't'
733 'old'
- 17600 ' he's'
+ 627 ' he'
+ 689 ''s'
1923 ' there'
35 ','
630 ' ''
1417 'RE'
791 ' you'
6189 ' sure'
54 '?'
- 23586 ' 'M'
+ 630 ' ''
+ 68 'M'
835 ' not'
6189 ' sure'
- 18068 ' I'll'
+ 383 ' I'
+ 6704 ''ll'
2463 ' make'
590 ' it'
35 ','
- 35018 ' 'D'
+ 630 ' ''
+ 59 'D'
791 ' you'
1647 ' like'
2032 ' some'
22940 ' tea'
54 '?'
2221 ' We'
30 '''
6815 'Ve'
279 ' a'
79905 ''l'
67 'L' and -expected tokens:
+got tokens:
348 ' '
40540 ' Hello'
- 472 '
+ 209 '
'
+ 348 ' '
40540 ' Hello' |
I'm not familiar with the tokenizer regex unfortunately, we inherited it from Bloom (maybe it should be called bloom tokenizer, instead?) but it looks like it's using some possibly non-standard regex features, and may not translate directly to the regex format supported in llama.cpp. In particular capturing groups inside character classes, or embedding character classes inside character classes both seem possibly non-standard to me, though it's been a long time since I've done regexes anywhere near this complicated. |
tokenizer_pre == "llama3" || | ||
tokenizer_pre == "llama-v3" || | ||
tokenizer_pre == "llama-bpe" || | ||
tokenizer_pre == "viking-7b") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs two changes:
- tokenizer_pre needs to be updated to match the "viking" in convert-hf-to-gguf.py
- this needs its own if statement block which sets vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_VIKING
@@ -12580,6 +12581,11 @@ struct llm_tokenizer_bpe { | |||
"(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", | |||
}); | |||
break; | |||
case LLAMA_VOCAB_PRE_TYPE_VIKING: | |||
word_collection = unicode_regex_split(text, { | |||
" ?[^(\\s|[.,!?…。,、।۔،])]+", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this regex works for the first test that fails, but I'm still left with one failing test I don't understand.
" ?[^\\s.,!?…。,、।۔،]+",
I suggested two changes which enable the code to work and the tests to pass, but now a different test is failing and I don't understand why. It looks like a unicode character is being split? Any idea what's going on here? I'm not confident on the updated regex, because I'm not sure what the expected behavior is for embedding a character class inside a character class is supposed to be, or for a capturing group within a character class--I've never seen that done before. "[^(\\s|[...])]" but it seems like the simpler statement should work?
|
Here’s a couple of updates related to this issue:
If you remove this part and regenerate the vocab files and run the tests again, the tests will pass. This is the failed test:
After we remove the digits block in tokenizer.json and regenerate the vocab files and run the tests, we get:
|
Try to add the following regex to the viking pre-tokenizer in
|
Thanks, that worked! So now we need to separate the pre-tokenizers for Poro and Viking. In llama.cpp, this needs to be added:
|
Closing in favor of #8135 :) |
LumiOpen/Viking-7B has a variant tokenizer
Just using
llama-bpe
had the model generate sensible Finnish, but this is attempting to do things a bit more correctly.See
llama-bpe
): https://huggingface.co/akx/Viking-7B-gguf