Viking tokenizer support #7328

akx · 2024-05-16T13:22:11Z

LumiOpen/Viking-7B has a variant tokenizer

Just using llama-bpe had the model generate sensible Finnish, but this is attempting to do things a bit more correctly.

See

https://huggingface.co/LumiOpen/Viking-7B/discussions/2#664602c8b88f8519c3d50113 for discussion.
converted model (with just llama-bpe): https://huggingface.co/akx/Viking-7B-gguf

ggerganov

The proper way is to update convert-hf-to-gguf-update.py and validate that the tokenization tests pass

akx · 2024-05-17T11:38:35Z

@ggerganov Thanks for the pointers. I think I'm doing things more correctly in this iteration, but the test-tokenizer-0 test is failing (output gist here), and I'm not quite sure where to go from there, even after trying to follow #6920...

test-tokenizer-1-bpe doesn't fail, but prints

llm_load_vocab: mismatch in special tokens definition ( 11/131072 vs 24/131072 ).

among other output.

cc @jonabur (of the Viking team)

ggerganov · 2024-05-17T12:07:52Z

If test-tokenizer-0 fails, this likely means that the pre-tokenizer config is not exactly the same as LLaMA3. I just checked the tokenizer.json file of that model and here is the relevant section:

  "pre_tokenizer": {
    "type": "Sequence",
    "pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": " ?[^(\\s|[.,!?…。，、।۔،])]+"
        },
        "behavior": "Isolated",
        "invert": false
      },
      {
        "type": "Digits",
        "individual_digits": true
      },
      {
        "type": "ByteLevel",
        "add_prefix_space": false,
        "trim_offsets": true,
        "use_regex": false
      }
    ]
  },

So you have to implement use the respective regexes in llama.cpp instead of re-using LLaMA3:

llama.cpp/llama.cpp

Lines 12287 to 12378 in 27b0406

    
           switch (vocab.type) { 
        
               case LLAMA_VOCAB_TYPE_BPE: 
        
                   switch (vocab.type_pre) { 
        
                       case LLAMA_VOCAB_PRE_TYPE_LLAMA3: 
        
                           ignore_merges = true; 
        
                           word_collection = unicode_regex_split(text, { 
        
                               // original regex from tokenizer.json 
        
                               //"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                               // adapted: https://github.com/ggerganov/llama.cpp/pull/6920#issuecomment-2080233989 
        
                               "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_DBRX: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               // same as llama3 
        
                               "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\r\n]", 
        
                               "\\s?[A-Za-zµÀ-ÖØ-öø-ƺƼ-ƿǄ-ʓʕ-ʯͰ-ͳͶͷͻ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-ՖႠ-ჅᎠ-Ᏽᏸ-ᏽᲐ-ᲺᲽ-Ჿᴀ-ᴫᵫ-ᵷᵹ-ᶚḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℴℹℼ-ℿⅅ-ⅉⅎↃↄⰀ-ⱻⱾ-ⳤⳫ-ⳮⳲⳳꙀ-ꙭꚀ-ꚛꜢ-ꝯꝱ-ꞇꞋ-ꞎꭰ-ꮿﬀ-ﬆﬓ-ﬗＡ-Ｚａ-ｚ𐐀-𐑏𐒰-𐓓𐓘-𐓻𐲀-𐲲𐳀-𐳲𑢠-𑣟𞤀-𞥃]+", 
        
                               "\\s?[!-/:-~！-／：-～‘-‟　-。]+", 
        
                               "\\s+$", 
        
                               "[一-龥ࠀ-一가-퟿]+", 
        
                               "\\p{N}+", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\r\n]", 
        
                               "\\s?\\p{L}+", 
        
                               "\\s?\\p{P}+", 
        
                               "[一-龥ࠀ-一가-퟿]+", 
        
                               "\\p{N}", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_FALCON: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\\p{P}\\$\\+<=>\\^~\\|]+", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                               "[0-9][0-9][0-9]", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_MPT: 
        
                           // TODO: MPT pre-tokenization regexes are unknown 
        
                           //       the following are close, but not exact. run the following: 
        
                           //       ./bin/test-tokenizer-0 ../models/ggml-vocab-mpt.gguf 
        
                           GGML_ASSERT("MPT pre-tokenization regexes are unknown - fixes needed"); 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "\\s?\\p{L}+", 
        
                               "\\s?\\p{P}+", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_STARCODER: 
        
                       case LLAMA_VOCAB_PRE_TYPE_REFACT: 
        
                       case LLAMA_VOCAB_PRE_TYPE_COMMAND_R: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "\\p{N}", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_GPT2: 
        
                       case LLAMA_VOCAB_PRE_TYPE_OLMO: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                           }); 
        
                           break; 
        
                       case LLAMA_VOCAB_PRE_TYPE_QWEN2: 
        
                           word_collection = unicode_regex_split(text, { 
        
                               // original regex from tokenizer.json 
        
                               // "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" 
        
                               "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+", 
        
                           }); 
        
                           break; 
        
                       default: 
        
                           // default regex for BPE tokenization pre-processing 
        
                           word_collection = unicode_regex_split(text, { 
        
                               "[\\p{P}\\$\\+<=>\\^~\\|]+", 
        
                               "'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)", 
        
                               "\\p{N}+", 
        
                               "[0-9][0-9][0-9]", 
        
                           }); 
        
                           break; 
        
                   } 
        
                   break; 
        
               default: 
        
                   GGML_ASSERT(false); 
        
                   break; 
        
           }

jonabur · 2024-05-27T08:32:40Z

We've also run into this same issue providing a gguf'd version of Poro, using the same tokenizer regex. @akx have you made any progress converting the regex format?

akx · 2024-05-27T17:07:26Z

@jonabur I'm working on it (again) now :)

~~Btw, is the tokenizer the same for the other Vikings?~~

EDIT: evidently it is, same chkhsh!

akx · 2024-05-27T17:46:08Z

I'm not sure what to make of the failing tokenizer tests even after adding in the regexp (and I suppose "digits" means another regexp should be splitting all digits into separate tokens?)

The actual detokenized output matches the expected output, but the token sequence doesn't. The first commit in this PR now improves the output of test-tokenizer-0 for easier diffing...

-expected tokens:
+got tokens:
    746 '
  '
   2392 '
 
  '
  55899 '
@@ -169,37 +169,43 @@
   3395 '""'
  30917 '......'
  17846 '!!!!'
   2420 '!!'
  13728 '????'
   3963 '??'
-  9873 ' I've'
+   383 ' I'
+  7029 ''ve'
   1912 ' been'
- 37493 ' 't'
+   630 ' ''
+   107 't'
    733 'old'
- 17600 ' he's'
+   627 ' he'
+   689 ''s'
   1923 ' there'
     35 ','
    630 ' ''
   1417 'RE'
    791 ' you'
   6189 ' sure'
     54 '?'
- 23586 ' 'M'
+   630 ' ''
+    68 'M'
    835 ' not'
   6189 ' sure'
- 18068 ' I'll'
+   383 ' I'
+  6704 ''ll'
   2463 ' make'
    590 ' it'
     35 ','
- 35018 ' 'D'
+   630 ' ''
+    59 'D'
    791 ' you'
   1647 ' like'
   2032 ' some'
  22940 ' tea'
     54 '?'
   2221 ' We'
     30 '''
   6815 'Ve'
    279 ' a'
  79905 ''l'
     67 'L'

and

-expected tokens:
+got tokens:
    348 '   '
  40540 ' Hello'
-   472 '
+   209 '
    '
+   348 '   '
  40540 ' Hello'

github-actions · 2024-05-27T20:22:37Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 550 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8502.32ms p(95)=20008.53ms fails=, finish reason: stop=502 truncated=48
Prompt processing (pp): avg=98.22tk/s p(95)=429.3tk/s
Token generation (tg): avg=35.91tk/s p(95)=46.57tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=viking-8b-b commit=2c8f62fd408bb6118561ff1d24423e8151925cc5

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716840726 --> 1716841352
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 402.85, 402.85, 402.85, 402.85, 402.85, 528.5, 528.5, 528.5, 528.5, 528.5, 542.77, 542.77, 542.77, 542.77, 542.77, 625.46, 625.46, 625.46, 625.46, 625.46, 649.16, 649.16, 649.16, 649.16, 649.16, 655.62, 655.62, 655.62, 655.62, 655.62, 676.61, 676.61, 676.61, 676.61, 676.61, 694.72, 694.72, 694.72, 694.72, 694.72, 713.12, 713.12, 713.12, 713.12, 713.12, 716.73, 716.73, 716.73, 716.73, 716.73, 735.72, 735.72, 735.72, 735.72, 735.72, 780.19, 780.19, 780.19, 780.19, 780.19, 779.96, 779.96, 779.96, 779.96, 779.96, 781.35, 781.35, 781.35, 781.35, 781.35, 791.92, 791.92, 791.92, 791.92, 791.92, 796.38, 796.38, 796.38, 796.38, 796.38, 793.92, 793.92, 793.92, 793.92, 793.92, 819.08, 819.08, 819.08, 819.08, 819.08, 823.43, 823.43, 823.43, 823.43, 823.43, 831.4, 831.4, 831.4, 831.4, 831.4, 832.03, 832.03, 832.03, 832.03, 832.03, 836.31, 836.31, 836.31, 836.31, 836.31, 830.72, 830.72, 830.72, 830.72, 830.72, 833.2, 833.2, 833.2, 833.2, 833.2, 832.08, 832.08, 832.08, 832.08, 832.08, 841.14, 841.14, 841.14, 841.14, 841.14, 841.13, 841.13, 841.13, 841.13, 841.13, 842.05, 842.05, 842.05, 842.05, 842.05, 845.61, 845.61, 845.61, 845.61, 845.61, 845.19, 845.19, 845.19, 845.19, 845.19, 844.75, 844.75, 844.75, 844.75, 844.75, 851.75, 851.75, 851.75, 851.75, 851.75, 857.33, 857.33, 857.33, 857.33, 857.33, 861.68, 861.68, 861.68, 861.68, 861.68, 862.18, 862.18, 862.18, 862.18, 862.18, 855.28, 855.28, 855.28, 855.28, 855.28, 853.75, 853.75, 853.75, 853.75, 853.75, 854.13, 854.13, 854.13, 854.13, 854.13, 856.14, 856.14, 856.14, 856.14, 856.14, 856.43, 856.43, 856.43, 856.43, 856.43, 864.87, 864.87, 864.87, 864.87, 864.87, 867.15, 867.15, 867.15, 867.15, 867.15, 866.01, 866.01, 866.01, 866.01, 866.01, 863.74, 863.74, 863.74, 863.74, 863.74, 860.52, 860.52, 860.52, 860.52, 860.52, 855.15, 855.15, 855.15, 855.15, 855.15, 858.56, 858.56, 858.56, 858.56, 858.56, 858.32, 858.32, 858.32, 858.32, 858.32, 863.47, 863.47, 863.47, 863.47, 863.47, 862.15, 862.15, 862.15, 862.15, 862.15, 863.52, 863.52, 863.52, 863.52, 863.52, 866.01, 866.01, 866.01, 866.01, 866.01, 864.87, 864.87, 864.87, 864.87, 864.87, 870.15, 870.15, 870.15, 870.15, 870.15, 871.53, 871.53, 871.53, 871.53, 871.53, 871.02, 871.02, 871.02, 871.02, 871.02, 871.72, 871.72, 871.72, 871.72, 871.72, 872.69, 872.69, 872.69, 872.69, 872.69, 873.18, 873.18, 873.18, 873.18, 873.18, 874.91, 874.91, 874.91, 874.91, 874.91, 875.27, 875.27, 875.27, 875.27]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716840726 --> 1716841352
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.62, 42.62, 42.62, 42.62, 42.62, 42.63, 42.63, 42.63, 42.63, 42.63, 32.58, 32.58, 32.58, 32.58, 32.58, 35.45, 35.45, 35.45, 35.45, 35.45, 34.73, 34.73, 34.73, 34.73, 34.73, 35.75, 35.75, 35.75, 35.75, 35.75, 36.52, 36.52, 36.52, 36.52, 36.52, 36.26, 36.26, 36.26, 36.26, 36.26, 35.63, 35.63, 35.63, 35.63, 35.63, 35.31, 35.31, 35.31, 35.31, 35.31, 35.43, 35.43, 35.43, 35.43, 35.43, 34.8, 34.8, 34.8, 34.8, 34.8, 33.55, 33.55, 33.55, 33.55, 33.55, 33.08, 33.08, 33.08, 33.08, 33.08, 31.65, 31.65, 31.65, 31.65, 31.65, 30.77, 30.77, 30.77, 30.77, 30.77, 30.41, 30.41, 30.41, 30.41, 30.41, 30.43, 30.43, 30.43, 30.43, 30.43, 30.02, 30.02, 30.02, 30.02, 30.02, 30.2, 30.2, 30.2, 30.2, 30.2, 30.28, 30.28, 30.28, 30.28, 30.28, 30.46, 30.46, 30.46, 30.46, 30.46, 30.26, 30.26, 30.26, 30.26, 30.26, 30.39, 30.39, 30.39, 30.39, 30.39, 30.69, 30.69, 30.69, 30.69, 30.69, 30.59, 30.59, 30.59, 30.59, 30.59, 30.55, 30.55, 30.55, 30.55, 30.55, 30.85, 30.85, 30.85, 30.85, 30.85, 31.11, 31.11, 31.11, 31.11, 31.11, 31.16, 31.16, 31.16, 31.16, 31.16, 31.31, 31.31, 31.31, 31.31, 31.31, 31.42, 31.42, 31.42, 31.42, 31.42, 31.18, 31.18, 31.18, 31.18, 31.18, 31.15, 31.15, 31.15, 31.15, 31.15, 30.67, 30.67, 30.67, 30.67, 30.67, 30.34, 30.34, 30.34, 30.34, 30.34, 30.48, 30.48, 30.48, 30.48, 30.48, 30.71, 30.71, 30.71, 30.71, 30.71, 30.85, 30.85, 30.85, 30.85, 30.85, 30.88, 30.88, 30.88, 30.88, 30.88, 30.99, 30.99, 30.99, 30.99, 30.99, 30.79, 30.79, 30.79, 30.79, 30.79, 30.42, 30.42, 30.42, 30.42, 30.42, 30.09, 30.09, 30.09, 30.09, 30.09, 29.11, 29.11, 29.11, 29.11, 29.11, 28.79, 28.79, 28.79, 28.79, 28.79, 28.76, 28.76, 28.76, 28.76, 28.76, 28.82, 28.82, 28.82, 28.82, 28.82, 28.98, 28.98, 28.98, 28.98, 28.98, 29.0, 29.0, 29.0, 29.0, 29.0, 29.07, 29.07, 29.07, 29.07, 29.07, 29.05, 29.05, 29.05, 29.05, 29.05, 28.98, 28.98, 28.98, 28.98, 28.98, 29.05, 29.05, 29.05, 29.05, 29.05, 29.13, 29.13, 29.13, 29.13, 29.13, 29.23, 29.23, 29.23, 29.23, 29.23, 29.37, 29.37, 29.37, 29.37, 29.37, 29.45, 29.45, 29.45, 29.45, 29.45, 29.51, 29.51, 29.51, 29.51, 29.51, 29.57, 29.57, 29.57, 29.57, 29.57, 29.58, 29.58, 29.58, 29.58]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716840726 --> 1716841352
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.4, 0.4, 0.4, 0.4, 0.4, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.11, 0.11, 0.11, 0.11, 0.11, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.21, 0.21, 0.21, 0.21, 0.21, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.41, 0.41, 0.41, 0.41, 0.41, 0.3, 0.3, 0.3, 0.3, 0.3, 0.32, 0.32, 0.32, 0.32, 0.32, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.29, 0.29, 0.29, 0.29, 0.29, 0.23, 0.23, 0.23, 0.23, 0.23, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.34, 0.34, 0.34, 0.34, 0.34, 0.5, 0.5, 0.5, 0.5, 0.5, 0.59, 0.59, 0.59, 0.59, 0.59, 0.55, 0.55, 0.55, 0.55, 0.55, 0.27, 0.27, 0.27, 0.27, 0.27, 0.19, 0.19, 0.19, 0.19, 0.19, 0.17, 0.17, 0.17, 0.17, 0.17, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.18, 0.18, 0.18, 0.18, 0.18, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.09, 0.09, 0.09, 0.09, 0.09, 0.18, 0.18, 0.18, 0.18, 0.18, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 550 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716840726 --> 1716841352
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0]

jonabur · 2024-05-30T06:43:15Z

I'm not familiar with the tokenizer regex unfortunately, we inherited it from Bloom (maybe it should be called bloom tokenizer, instead?) but it looks like it's using some possibly non-standard regex features, and may not translate directly to the regex format supported in llama.cpp.

In particular capturing groups inside character classes, or embedding character classes inside character classes both seem possibly non-standard to me, though it's been a long time since I've done regexes anywhere near this complicated.

jonabur · 2024-05-30T13:50:12Z

llama.cpp

+                    tokenizer_pre == "llama3"    ||
+                    tokenizer_pre == "llama-v3"  ||
+                    tokenizer_pre == "llama-bpe" ||
+                    tokenizer_pre == "viking-7b") {


This needs two changes:

tokenizer_pre needs to be updated to match the "viking" in convert-hf-to-gguf.py

this needs its own if statement block which sets vocab.type_pre = LLAMA_VOCAB_PRE_TYPE_VIKING

jonabur · 2024-05-30T14:02:04Z

llama.cpp

@@ -12580,6 +12581,11 @@ struct llm_tokenizer_bpe {
                            "(?:'[sS]|'[tT]|'[rR][eE]|'[vV][eE]|'[mM]|'[lL][lL]|'[dD])|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+",
                        });
                        break;
+                    case LLAMA_VOCAB_PRE_TYPE_VIKING:
+                        word_collection = unicode_regex_split(text, {
+                            " ?[^(\\s|[.,!?…。，、।۔،])]+",


I think this regex works for the first test that fails, but I'm still left with one failing test I don't understand.

" ?[^\\s.,!?…。，、।۔،]+",

jonabur · 2024-05-30T14:06:01Z

I suggested two changes which enable the code to work and the tests to pass, but now a different test is failing and I don't understand why. It looks like a unicode character is being split? Any idea what's going on here?

I'm not confident on the updated regex, because I'm not sure what the expected behavior is for embedding a character class inside a character class is supposed to be, or for a capturing group within a character class--I've never seen that done before. "[^(\\s|[...])]" but it seems like the simpler statement should work?

src: 'ied 4 ½ months'
res: 'ied 4 ½ months'
tok: 1502 231 43 882 145 9290
failed test:    'ied 4 ½ months'
detokenized to: 'ied 4 ½ months'
(which matches the expected output)
expected tokens:
  1502 'ied'
   231 ' '
    43 '4'
   231 ' '
  1177 '½'
  9290 ' months'

got tokens:
  1502 'ied'
   231 ' '
    43 '4'
   882 ' �'
   145 '�'
  9290 ' months'

ezosa · 2024-06-20T07:44:59Z

Here’s a couple of updates related to this issue:

Poro-34b-chat pre-tokenizer support is now merged (Poro-34B-chat tokenizer support #7713 ). Since Poro and Viking share the same pre-tokenizer regex, you can use poro-chat as the pre-tokenizer type for Viking. So the changes to convert-hf-to-gguf.py will be:

        if chkhsh == "7fc505bd3104ca1083b150b17d088b59534ede9bde81f0dd2090967d7fe52cee":
            # ref: https://huggingface.co/LumiOpen/Viking-7B
            res = "poro-chat"

We looked further into the failed test for Viking and found that this block in tokenizer.json is related to the issue, under the pre_tokenizer attribute (this does not exist in Poro's tokenizer so we didn't get this problem there):

      {
        "type": "Digits",
        "individual_digits": true
      },

If you remove this part and regenerate the vocab files and run the tests again, the tests will pass.
The tokenization of the test strings with and without the digits block is the same. Therefore we think that removing that block will not cause any issues in the downstream performance of the quantised model.

This is the failed test:

src: 'ied 4 ½ months'
res: 'ied 4 ½ months'
tok: 1502 231 43 882 145 9290 
main : failed test:    'ied 4 ½ months'
main : detokenized to: 'ied 4 ½ months' instead of 'ied 4 ½ months'
main : expected tokens:   1502 'ied',    231 ' ',     43 '4',    231 ' ',   1177 '½',   9290 ' months', 
main : got tokens:        1502 'ied',    231 ' ',     43 '4',    882 ' �',    145 '�',   9290 ' months',

After we remove the digits block in tokenizer.json and regenerate the vocab files and run the tests, we get:

src: 'ied 4 ½ months'
res: 'ied 4 ½ months'
tok: 1502 231 43 882 145 9290

ggerganov · 2024-06-20T07:51:16Z

Try to add the following regex to the viking pre-tokenizer in llama.cpp:

"\\p{N}",

ezosa · 2024-06-20T09:07:26Z

Thanks, that worked! So now we need to separate the pre-tokenizers for Poro and Viking. In llama.cpp, this needs to be added:

            case LLAMA_VOCAB_PRE_TYPE_VIKING:
                regex_exprs = {
                    "\\p{N}",
                    " ?[^(\\s|.,!?…。，、।۔،)]+",
                };
                break;

akx · 2024-07-01T08:17:31Z

Closing in favor of #8135 :)

mofosyne added model Model specific python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix labels May 16, 2024

ggerganov reviewed May 17, 2024

View reviewed changes

akx mentioned this pull request May 17, 2024

add Viking tokenizer support #7329

Closed

akx marked this pull request as draft May 17, 2024 10:52

akx force-pushed the viking-8b-b branch from 582491a to e69ffba Compare May 17, 2024 11:07

akx mentioned this pull request May 17, 2024

convert-hf-to-gguf-update improvements #7340

Merged

akx force-pushed the viking-8b-b branch 2 times, most recently from ab842e3 to 69f815d Compare May 17, 2024 11:32

akx changed the title ~~Add LumiOpen/Viking-7B to converter script~~ Viking-7B tokenizer support May 17, 2024

akx mentioned this pull request May 26, 2024

Readme: add akx/ggify to tools #1484

Merged

akx force-pushed the viking-8b-b branch from 69f815d to a89252a Compare May 27, 2024 17:24

github-actions bot added the testing Everything test related label May 27, 2024

akx added 2 commits May 27, 2024 20:45

test-tokenizer-0: improve output, show how many tests failed

c28c996

Add Viking-7B tokenizer support

2c8f62f

akx force-pushed the viking-8b-b branch from a89252a to 2c8f62f Compare May 27, 2024 17:45

akx changed the title ~~Viking-7B tokenizer support~~ Viking tokenizer support May 27, 2024

jonabur reviewed May 30, 2024

View reviewed changes

ezosa mentioned this pull request Jun 3, 2024

Poro-34B-chat tokenizer support #7713

Merged

kustaaya mentioned this pull request Jun 26, 2024

Added support for Viking pre-tokenizer #8135

Merged

4 tasks

akx closed this Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Viking tokenizer support #7328

Viking tokenizer support #7328

akx commented May 16, 2024 •

edited

Loading

ggerganov left a comment

akx commented May 17, 2024

ggerganov commented May 17, 2024

jonabur commented May 27, 2024

akx commented May 27, 2024 •

edited

Loading

akx commented May 27, 2024

github-actions bot commented May 27, 2024

jonabur commented May 30, 2024

jonabur May 30, 2024

jonabur May 30, 2024

jonabur commented May 30, 2024 •

edited

Loading

ezosa commented Jun 20, 2024

ggerganov commented Jun 20, 2024

ezosa commented Jun 20, 2024

akx commented Jul 1, 2024

Viking tokenizer support #7328

Viking tokenizer support #7328

Conversation

akx commented May 16, 2024 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

akx commented May 17, 2024

ggerganov commented May 17, 2024

jonabur commented May 27, 2024

akx commented May 27, 2024 • edited Loading

akx commented May 27, 2024

github-actions bot commented May 27, 2024

jonabur commented May 30, 2024

jonabur May 30, 2024

Choose a reason for hiding this comment

jonabur May 30, 2024

Choose a reason for hiding this comment

jonabur commented May 30, 2024 • edited Loading

ezosa commented Jun 20, 2024

ggerganov commented Jun 20, 2024

ezosa commented Jun 20, 2024

akx commented Jul 1, 2024

akx commented May 16, 2024 •

edited

Loading

akx commented May 27, 2024 •

edited

Loading

jonabur commented May 30, 2024 •

edited

Loading