Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add Deepseek support #5981 #6252

Closed
wants to merge 12 commits into from

Conversation

dragnil1
Copy link
Contributor

ref #5981

unicode.h Outdated
Comment on lines 29 to 31
std::vector<std::wstring> get_gpt2_regex();
std::vector<std::wstring> get_deepseek_coder_regex();
std::vector<std::wstring> get_deepseek_llm_regex();
Copy link
Owner

@ggerganov ggerganov Mar 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking the interface here should be:

std::vector<std::string> unicode_regex_split(const std::string & text, const std::vector<std::string> & regexes);

The implementation should be something like what regex_bpe_preprocess currently is. It loops through the regex strings and if we have a known unicode representation (e.g. "//\s?\p{L}+" -> std::wstring) - we apply it with std::wregex. Else, if we have a custom implementation (for example see the GPT2 preprocess function) then we apply that.

The unicode module should not have any kind of notion about GPT2, Deepseek or other model-related stuff. This information should be in llama.cpp

llama.cpp Outdated
Comment on lines 10097 to 10099
std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) {
return regex_bpe_preprocess(text, get_deepseek_coder_regex());
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following my previous comment, this should eventually become:

Suggested change
std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) {
return regex_bpe_preprocess(text, get_deepseek_coder_regex());
}
std::vector<std::string> bpe_deepseek_coder_preprocess(const std::string & text) {
return unicode_regex_split(text, {
"[\\p{P}\\$\\+<=>\\^~\\|]+",
"'s|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)",
"[0-9][0-9][0-9]"
"\\s?\\p{L}+",
"\\s?\\p{P}+",
"\\p{N}",
});
}

llama.cpp Outdated
const llama_vocab & vocab;

std::vector<llm_symbol> symbols;
std::vector<llm_symbol> symbols_final;

llm_bigram_bpe::queue work_queue;

const std::vector<std::wstring> gpt2_regex = {
Copy link
Owner

@ggerganov ggerganov Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a different idea about this - let me try to explain again:

In llama.cpp, we want to keep the original regex strings as they have been specified by the model creators. For example:

  • 's|'t|'re|'ve|'m|'ll|'d| ?\\p{L}+| ?\\p{N}+| ?[^\\s\\p{L}\\p{N}]+|\\s+(?!\\S)
  • \s?\p{L}+
  • etc.

Now, my understanding is that in C++ we cannot simply perform some of those regex matching due to the lack of support for some the regex patterns in the standard library. So to solve this issue, we create the unicode module, which takes the regex strings from above as they are and performs a few different strategies to split the target string:

  • If we have a known unicode representation generated in some way, we apply that using std::wregex. I.e. we check a constant std::map<std::string, std::wstring> for the presence of the regex
  • If not, we then check if we have a custom implementation of the regex via a function call (see bpe_gpt2_preprocess() on master which is a custom implementation of regex split with 's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+)
  • Else, we just apply std::regex and hope for the best, or throw an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, in the unicode.cpp, I have implemented the unicode_regex_split() function which iterates through the given regexes and if some match is found, it uses the regex. Otherwise, it uses the modified bpe_gpt2_preprocess() function renamed as unicode_custom_preprocess(). Now, I have some questions regarding bpe_gpt2_preprocess(). Can it handle the input of deepseek coder and deepseek llm? If no, so, I have to write custom function for them when regex is not found?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it handle the input of deepseek coder and deepseek llm?

No, AFAIK it is not compatible with the deepseek regex. The way I understand it is that bpe_gpt2_preprocess() (i.e. unicode_custom_preprocess()) works only for the following regex (based on the comment in the code):

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

So the logic in unicode_regex_split() has to check if this is the input regex and only apply this specific custom implementation in that case. For other regexes, we might want to implement more custom implementations in other functions and use them in unicode_regex_split() in the future.

Note that this my understanding of how this part of the tokenizer is supposed to work. I could be wrong, so don't take all of these suggestions for granted.

In any case, the huge unicode constants like gpt2_regex should not be located in llama.cpp, but instead should be in unicode.cpp.

Copy link
Contributor

github-actions bot commented Apr 16, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 424 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=11118.74ms p(95)=30395.38ms fails=, finish reason: stop=368 truncated=56
  • Prompt processing (pp): avg=123.28tk/s p(95)=545.28tk/s
  • Token generation (tg): avg=25.14tk/s p(95)=34.55tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=unicode-refactor-regex commit=d58d9d80f8152edb5ac913d4f97fea129e3c4d93

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 424 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1713319108 --> 1713319738
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 407.6, 407.6, 407.6, 407.6, 407.6, 537.75, 537.75, 537.75, 537.75, 537.75, 508.81, 508.81, 508.81, 508.81, 508.81, 527.07, 527.07, 527.07, 527.07, 527.07, 552.2, 552.2, 552.2, 552.2, 552.2, 586.64, 586.64, 586.64, 586.64, 586.64, 602.38, 602.38, 602.38, 602.38, 602.38, 602.53, 602.53, 602.53, 602.53, 602.53, 619.4, 619.4, 619.4, 619.4, 619.4, 633.59, 633.59, 633.59, 633.59, 633.59, 635.56, 635.56, 635.56, 635.56, 635.56, 639.99, 639.99, 639.99, 639.99, 639.99, 642.47, 642.47, 642.47, 642.47, 642.47, 662.49, 662.49, 662.49, 662.49, 662.49, 684.86, 684.86, 684.86, 684.86, 684.86, 693.72, 693.72, 693.72, 693.72, 693.72, 712.1, 712.1, 712.1, 712.1, 712.1, 661.12, 661.12, 661.12, 661.12, 661.12, 665.11, 665.11, 665.11, 665.11, 665.11, 665.19, 665.19, 665.19, 665.19, 665.19, 671.25, 671.25, 671.25, 671.25, 671.25, 678.26, 678.26, 678.26, 678.26, 678.26, 679.58, 679.58, 679.58, 679.58, 679.58, 678.61, 678.61, 678.61, 678.61, 678.61, 682.21, 682.21, 682.21, 682.21, 682.21, 684.86, 684.86, 684.86, 684.86, 684.86, 686.44, 686.44, 686.44, 686.44, 686.44, 676.87, 676.87, 676.87, 676.87, 676.87, 642.54, 642.54, 642.54, 642.54, 642.54, 644.28, 644.28, 644.28, 644.28, 644.28, 645.18, 645.18, 645.18, 645.18, 645.18, 654.43, 654.43, 654.43, 654.43, 654.43, 653.05, 653.05, 653.05, 653.05, 653.05, 650.96, 650.96, 650.96, 650.96, 650.96, 650.83, 650.83, 650.83, 650.83, 650.83, 651.5, 651.5, 651.5, 651.5, 651.5, 654.0, 654.0, 654.0, 654.0, 654.0, 657.91, 657.91, 657.91, 657.91, 657.91, 658.25, 658.25, 658.25, 658.25, 658.25, 657.73, 657.73, 657.73, 657.73, 657.73, 661.35, 661.35, 661.35, 661.35, 661.35, 665.56, 665.56, 665.56, 665.56, 665.56, 665.39, 665.39, 665.39, 665.39, 665.39, 669.0, 669.0, 669.0, 669.0, 669.0, 674.14, 674.14, 674.14, 674.14, 674.14, 676.33, 676.33, 676.33, 676.33, 676.33, 676.03, 676.03, 676.03, 676.03, 676.03, 676.14, 676.14, 676.14, 676.14, 676.14, 676.1, 676.1, 676.1, 676.1, 676.1, 677.6, 677.6, 677.6, 677.6, 677.6, 677.81, 677.81, 677.81, 677.81, 677.81, 682.11, 682.11, 682.11, 682.11, 682.11, 685.15, 685.15, 685.15, 685.15, 685.15, 682.5, 682.5, 682.5, 682.5, 682.5, 679.36, 679.36, 679.36, 679.36, 679.36, 678.26, 678.26, 678.26, 678.26, 678.26, 676.88, 676.88, 676.88, 676.88, 676.88, 675.51, 675.51, 675.51, 675.51, 675.51, 676.93, 676.93, 676.93, 676.93, 676.93, 678.28, 678.28, 678.28, 678.28, 678.28, 678.29]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 424 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1713319108 --> 1713319738
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 33.88, 33.88, 33.88, 33.88, 33.88, 30.14, 30.14, 30.14, 30.14, 30.14, 21.46, 21.46, 21.46, 21.46, 21.46, 21.36, 21.36, 21.36, 21.36, 21.36, 20.41, 20.41, 20.41, 20.41, 20.41, 20.79, 20.79, 20.79, 20.79, 20.79, 21.22, 21.22, 21.22, 21.22, 21.22, 22.33, 22.33, 22.33, 22.33, 22.33, 23.23, 23.23, 23.23, 23.23, 23.23, 23.6, 23.6, 23.6, 23.6, 23.6, 23.61, 23.61, 23.61, 23.61, 23.61, 23.59, 23.59, 23.59, 23.59, 23.59, 23.66, 23.66, 23.66, 23.66, 23.66, 23.65, 23.65, 23.65, 23.65, 23.65, 23.57, 23.57, 23.57, 23.57, 23.57, 23.14, 23.14, 23.14, 23.14, 23.14, 22.51, 22.51, 22.51, 22.51, 22.51, 22.27, 22.27, 22.27, 22.27, 22.27, 22.36, 22.36, 22.36, 22.36, 22.36, 22.52, 22.52, 22.52, 22.52, 22.52, 22.75, 22.75, 22.75, 22.75, 22.75, 22.62, 22.62, 22.62, 22.62, 22.62, 22.35, 22.35, 22.35, 22.35, 22.35, 22.26, 22.26, 22.26, 22.26, 22.26, 22.25, 22.25, 22.25, 22.25, 22.25, 22.23, 22.23, 22.23, 22.23, 22.23, 22.35, 22.35, 22.35, 22.35, 22.35, 22.45, 22.45, 22.45, 22.45, 22.45, 22.59, 22.59, 22.59, 22.59, 22.59, 22.72, 22.72, 22.72, 22.72, 22.72, 22.81, 22.81, 22.81, 22.81, 22.81, 22.78, 22.78, 22.78, 22.78, 22.78, 22.61, 22.61, 22.61, 22.61, 22.61, 22.11, 22.11, 22.11, 22.11, 22.11, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.68, 21.85, 21.85, 21.85, 21.85, 21.85, 21.91, 21.91, 21.91, 21.91, 21.91, 21.99, 21.99, 21.99, 21.99, 21.99, 22.13, 22.13, 22.13, 22.13, 22.13, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.16, 22.18, 22.18, 22.18, 22.18, 22.18, 22.16, 22.16, 22.16, 22.16, 22.16, 22.02, 22.02, 22.02, 22.02, 22.02, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.76, 21.78, 21.78, 21.78, 21.78, 21.78, 21.86, 21.86, 21.86, 21.86, 21.86, 21.93, 21.93, 21.93, 21.93, 21.93, 22.0, 22.0, 22.0, 22.0, 22.0, 22.09, 22.09, 22.09, 22.09, 22.09, 22.01, 22.01, 22.01, 22.01, 22.01, 21.9, 21.9, 21.9, 21.9, 21.9, 21.66, 21.66, 21.66, 21.66, 21.66, 21.61, 21.61, 21.61, 21.61, 21.61, 21.39, 21.39, 21.39, 21.39, 21.39, 20.85, 20.85, 20.85, 20.85, 20.85, 20.61, 20.61, 20.61, 20.61, 20.61, 20.52, 20.52, 20.52, 20.52, 20.52, 20.56]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 424 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1713319108 --> 1713319738
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.38, 0.38, 0.38, 0.38, 0.38, 0.24, 0.24, 0.24, 0.24, 0.24, 0.29, 0.29, 0.29, 0.29, 0.29, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.2, 0.2, 0.2, 0.2, 0.2, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.29, 0.29, 0.29, 0.29, 0.29, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.36, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.3, 0.3, 0.3, 0.3, 0.3, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.45, 0.45, 0.45, 0.45, 0.45, 0.46, 0.46, 0.46, 0.46, 0.46, 0.42, 0.42, 0.42, 0.42, 0.42, 0.5, 0.5, 0.5, 0.5, 0.5, 0.44, 0.44, 0.44, 0.44, 0.44, 0.34, 0.34, 0.34, 0.34, 0.34, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.18]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 424 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1713319108 --> 1713319738
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0]
                    
Loading

@ggerganov
Copy link
Owner

@dragnil1 Thank you for the help. With LLaMA v3 now switching to BPE tokenizer, this functionality becomes very important (see #6914).

I'll focus on finalizing it ASAP and will likely use the work in this PR as a starting point. Will probably move the branch in the llama.cpp repo so that we can run ggml-ci as well.

The tokenization and unicode handling is definitely not my strongest and favourite part of the codebase, so if you or anyone else have any insights, don't hesitate to share or help out. I think I have the understanding of how to implement BPE pre-processing support, but I could very well be missing something

@dragnil1
Copy link
Contributor Author

dragnil1 commented Apr 26, 2024

@dragnil1 Thank you for the help. With LLaMA v3 now switching to BPE tokenizer, this functionality becomes very important (see #6914).

I'll focus on finalizing it ASAP and will likely use the work in this PR as a starting point. Will probably move the branch in the llama.cpp repo so that we can run ggml-ci as well.

The tokenization and unicode handling is definitely not my strongest and favourite part of the codebase, so if you or anyone else have any insights, don't hesitate to share or help out. I think I have the understanding of how to implement BPE pre-processing support, but I could very well be missing something

Thanks for letting me work on this pr. Sorry for the delay. I was working on it to get it work on windows. The recent commit passed the tests in ubuntu. But the tokenizer tests failed on windows. I had the idea of using std::wregex when the wchar_t size is 32 bits and std::regex when wchart_t size is 16. But using regex is giving SEGFAULT on tokenizer tests on windows. This is probably because the regex pattern from ReFlex library is too large than the the regex pattern used for wregex. While doing some research on it, I found the most efficient way will be to use boost library with icu support. But it will hamper the minimal dependency support of llama.cpp. Otherwise, we can use the standalone boost regex library. I was also thinking of converting the regex pattern produced by the ReFlex library to utf-32 pattern and utf-16 pattern regex pattern, that respectively works on ubuntu and probably works on windows.

@ggerganov
Copy link
Owner

ggerganov commented Apr 26, 2024

But using regex is giving SEGFAULT on tokenizer tests on windows. This is probably because the regex pattern from ReFlex library is too large than the the regex pattern used for wregex.

In the latest version #6920, I changed the order of the the regexes: first look for custom implementation and then look for a known equivalent regex. I also disabled DeepSeek code-paths temporary until we get the tests running as they were on master. So with these changes, I don't think we apply large regexes, but it still crashes (based on the Windows CIs).

I don't have a Windows environment to work on, so it's gonna take me some time to figure out where it goes wrong

Edit: apparently the Windows build fails were unrelated - hopefully we have a baseline that works now

@dragnil1
Copy link
Contributor Author

dragnil1 commented Apr 26, 2024

Ok. I have found the reason of the test failing in windows. Some regex ranges are not valid in windows. Here is an example which will run.

#include <iostream>
#include <string>
#include <regex>

int main() {
    std::wregex pattern2(L"[\U00000041-\U0000005A]"); // will run
    return 0;
}

Here is an example which will not run.

#include <iostream>
#include <string>
#include <regex>

int main() {
    std::wregex pattern1(L"[\U00011700-\U0001171A]"); // will not run
    return 0;
}

Both regex ranges are taken from gpt2 regex.

@ggerganov
Copy link
Owner

Yes, I just noticed the error in the CI:

6: llama_model_load: error loading model: error loading model vocabulary: regex_error(error_range): The expression contained an invalid character range, such as [b-a] in most encodings.

https://github.com/ggerganov/llama.cpp/actions/runs/8850389799/job/24304574061?pr=6920#step:12:1392

Any ideas how to resolve?

@dragnil1
Copy link
Contributor Author

dragnil1 commented Apr 26, 2024

Yes, I just noticed the error in the CI:

6: llama_model_load: error loading model: error loading model vocabulary: regex_error(error_range): The expression contained an invalid character range, such as [b-a] in most encodings.

https://github.com/ggerganov/llama.cpp/actions/runs/8850389799/job/24304574061?pr=6920#step:12:1392

Any ideas how to resolve?

We cannot use ranges which contain codepoints that requires greater than 2 bytes. We have to convert the 3 bytes or 4 bytes ranges to individual values. But this may cause the regex pattern to be large enough to result in a SEGFAULT. I will let you know if I can find a viable solution.

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level enhancement New feature or request labels May 10, 2024
@Galunid Galunid added the obsolete? Marker for potentially obsolete PR label Jun 15, 2024
@Galunid Galunid closed this Jun 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request obsolete? Marker for potentially obsolete PR Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants