-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add refact model #3329
add refact model #3329
Conversation
a7f46fd
to
81c00c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did not test.
on a side note: we end up with more and more duplicated code when we add more and more models. at least for the hf model loading in python.
I tested it out and it's working as expected. Any idea on how you'd like it refactored to reduce the duplicates down to a single If not, I could probably mock up some ideas once I have some resources and time. As long as I have a bit of an idea (e.g. a track) of how to go about it or what might be expected, it shouldn't be much of a problem. Merging multiple interfaces into a single coherent interface in python is kind of my thing. |
@teleprint-me it is probably worthy changing it after the special token issue is resolved. #2820. I think we had a few bugs in the current (all) converter when there is no added_token.json but special_tokens_map.json. wdyt? |
I think that attempting to support a variety of variable special token types is a challenging task. Probably not impossible, but also probably not worth pursuing. A standard interface like ChatML would be better suited rather than attempting to adapt to a variety of variable special tokens that would be model specific. There really isn't a "holy grail" solution to this. It's up to the dataset creators as well as the fine-tuners. It's easier to create a generally abstracted interface that can accommodate and adapt to the variable structures that creators and developers would want to implement. I mentioned something similar to this on a issue in llama-cpp-python because the bos and eos tokens are hard-coded into the method for creating chat completions. This isn't an issue specific to llama.cpp. It's an issue that requires a generally agreed upon specification for everyone to agree to operate under. I don't see that happening for awhile though, so we'll see. Creating a general interface for handling variable conversions would face a similar issue, would be manageable and generally worth pursing if only to reduce the number of front-end scripts and code duplication. You could create a factory that would abstract it and then create the instance for converting the tensors and have a single front-end CLI interface as a result. This would be modular, approachable, maintainable, as well as extensible. |
@teleprint-me i totally agree with you point that we should probably modularize the converter for HuggingFace one. do you think we could do it in another PR? i guess merging HuggingFace falcon, baichuan, refact and starcoder would be easier to start since convert.py also includes original llama pt version. |
I am getting a bunch of Details
|
benchmarks:cpu only with openblas:
gpu only cuda:
|
This sounds like a good place to start. I usually prefer using a template to build off of.
Sure, we can do it in another PR. I don't mind at all.
I'm open to looking into this and creating a skeleton or writing up an outline to plan it out. Let me know. |
@Green-Sky that adding is expected to match the vocab size. it is similar to falcon converter code. btw, the CI is not passing due to this one https://github.com/ggerganov/llama.cpp/actions/runs/6309006961/job/17136638816?pr=3329.
The rebasing is not helpful. Does anyone know how we can fix it? thanks. cc @Green-Sky @ggerganov |
It is a known issue, you can safely ignore the freeBSD CI failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me.
did some very basic testing, but no FIM. still waiting on #2934
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to adapt to the new llama_batch
API and also replace ggml_alibi
with ggml_add
as we did for ggml_diag_mask_inf
in #3228
Lines 3250 to 3262 in 0e76a89
switch (model.type) { | |
case MODEL_7B: | |
KQ_masked = ggml_add(ctx0, KQ_scaled, KQ_mask); | |
break; | |
case MODEL_13B: | |
// TODO: replace with ggml_add() | |
KQ_scaled_alibi = ggml_alibi(ctx0, KQ_scaled, /*n_past*/ 0, n_head, 8); | |
ggml_set_name(KQ_scaled_alibi, "KQ_scaled_alibi"); | |
KQ_masked = ggml_add(ctx0, KQ_scaled_alibi, KQ_mask); | |
break; | |
default: | |
GGML_ASSERT(false); | |
} |
@ggerganov i am getting incorrect result for refact (alibi) when switching to use the new KQ_mask construction outside for loop and ggml_add. only |
No, baichuan 13B is known to not work at the moment. The |
@ggerganov i took my word back. it doesn't have a problem. i will push a new PR based on the latest branch |
@ggerganov @Green-Sky i have pushed the new commit to rebase to the latest one. it will pass the metal gpu, however it will fail on CPU only mode with this error from
i added another fix to remove the assert since we pass |
Does this implementation produce correct results? I think the |
@ggerganov i am getting correct results. i followed the example code here. Line 3218 in 40e07a6
do you mean we should actually set n_past instead of changing it to 0? And i tested it by setting it with kv_head (old n_past). It looks like the result is identical and no difference on speed.
|
I see Refact is using a GPT2-based tokenizer. Would you care to check the impact of #3252 on this conversion (although I didn't have time to consider |
Ah interesting. I just realize that Might look into merging #3252 before that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's resolve conflicts and merge
Hope I didn't break something with that last merge |
Hi @ggerganov,
|
Thanks for reporting this - there are some ongoing tokenizer changes recently and things seem to be unstable. Similar issue was reported in #3484 - not sure if related |
…example * 'master' of github.com:ggerganov/llama.cpp: (24 commits) convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299) readme : add project status link ggml : fix build after ggerganov#3329 llm : add Refact model (ggerganov#3329) sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468) finetune : readme fix typo (ggerganov#3465) ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453) main : consistent prefix/suffix coloring (ggerganov#3425) llama : fix session saving/loading (ggerganov#3400) llama : expose model's rope_freq_scale in the API (ggerganov#3418) metal : alibi for arbitrary number of heads (ggerganov#3426) cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273) Work on the BPE tokenizer (ggerganov#3252) convert : fix vocab size when not defined in hparams (ggerganov#3421) cmake : increase minimum version for add_link_options (ggerganov#3444) CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402) gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408) gguf : general usability improvements (ggerganov#3409) cmake : make CUDA flags more similar to the Makefile (ggerganov#3420) finetune : fix ggerganov#3404 (ggerganov#3437) ...
@martell I think you need to re-convert the model using the updated python script and it should work |
I should have shared a checksum in the original comment, I converted it at commit 0d152b3 llama.cpp
Refact-1_6B-fim
converted using
|
* add refact model * resolve comments * rebase to the latest * solve alibi cpu error --------- Co-authored-by: Georgi Gerganov <[email protected]>
@ggerganov Can confirm that it now runs with 42833bc
It seems to terminate early sometimes but I presume that is due to nans before alibi being discussed there.
|
This should be resolved in 42833bc |
example command (greedy) to test against huggingface.
python3 convert-refact-hf-to-gguf.py ./Refact-1_6B-fim 1 ./main -m ./Refact-1_6B-fim/ggml-model-f16.gguf -n 300 -p "write a function to multiple two integers in python" --temp 1.0 --top-p 1.0 --top-k 1 --repeat_penalty 1.0
resolve: #3061