Option to split during conversion #6942

christianazinn · 2024-04-27T04:23:58Z

This PR introduces additional options to convert.py that allow users to split a model into shards while converting rather than having to do it after conversion, including a default small first shard as outlined in #6463.

Other functionality we ought to have includes --split-max-size (so far it's just --split-max-tensors), displaying estimated shard sizes, dry running, and adding sharding for the other convert-*-to-*.py scripts. This will be considered a draft until those are worked out. Also needs considerable testing, but luckily as this deals with the Python scripts, it can be tested easily.

Usage

(examples are using zephyr-smol_llama-100m-sft-full)

Example, `--split-max-size`

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-size 64M

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00005.gguf: n_tensors = 0, total_size = negligible - metadata only
    /path/to/outfile-00002-of-00005.gguf: n_tensors = 1, total_size = 47.1M
    /path/to/outfile-00003-of-00005.gguf: n_tensors = 11, total_size = 63.6M
    /path/to/outfile-00004-of-00005.gguf: n_tensors = 32, total_size = 63.4M
    /path/to/outfile-00005-of-00005.gguf: n_tensors = 13, total_size = 19.1M

Writing shard 2/5 with 1/57 tensors remaining (of 57 total)
[1/1] Writing tensor output.weight                          | size  32128 x    768  | type F16  | T+   2

Writing shard 3/5 with 11/56 tensors remaining (of 57 total)
[ 1/11] Writing tensor token_embd.weight                      | size  32128 x    768  | type F16  | T+   2
[ 2/11] Writing tensor blk.0.attn_norm.weight                 | size    768           | type F32  | T+   3
[ 3/11] Writing tensor blk.0.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   3
[ 4/11] Writing tensor blk.0.ffn_gate.weight                  | size   3072 x    768  | type F16  | T+   3
[ 5/11] Writing tensor blk.0.ffn_up.weight                    | size   3072 x    768  | type F16  | T+   3
[ 6/11] Writing tensor blk.0.ffn_norm.weight                  | size    768           | type F32  | T+   3
[ 7/11] Writing tensor blk.0.attn_k.weight                    | size    256 x    768  | type F16  | T+   3
[ 8/11] Writing tensor blk.0.attn_output.weight               | size    768 x    768  | type F16  | T+   3
[ 9/11] Writing tensor blk.0.attn_q.weight                    | size    768 x    768  | type F16  | T+   3
[10/11] Writing tensor blk.0.attn_v.weight                    | size    256 x    768  | type F16  | T+   3
[11/11] Writing tensor blk.1.attn_norm.weight                 | size    768           | type F32  | T+   3

Writing shard 4/5 with 32/45 tensors remaining (of 57 total)
[ 1/32] Writing tensor blk.1.ffn_down.weight                  | size    768 x   3072  | type F16  | T+   0
[etc...]

With --split-max-size 200M (or any number greater than the total resultant size), it gives:

Model has smaller size than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

[the rest of output is the same as in master]

Example, `--split-max-tensors` with `--dry-run`, `--large-first-shard`

python3 convert.py --outfile /path/to/outfile.gguf --outtype f16 /path/to/safetensors --split --split-max-tensors 20 --dry-run --large-first-shard

Output: equal to what's printed to stdout from master, then

Writing the following files:
    /path/to/outfile-00001-of-00003.gguf: n_tensors = 20, total_size = 127.1M
    /path/to/outfile-00002-of-00003.gguf: n_tensors = 20, total_size = 37.5M
    /path/to/outfile-00003-of-00003.gguf: n_tensors = 17, total_size = 28.5M

Dry run, not writing files

With --split-max-tensors 64 (or any number greater than the total tensor count), it gives:

Model has fewer tensors than the split threshold, not splitting

Writing the following files:
    /path/to/outfile.gguf: n_tensors = 57, total_size = 193.2M

Dry run, not writing files

References

christianazinn · 2024-04-28T01:23:06Z

I've added support for --split-max-size and --dry-run, taking a page out of gguf-split.cpp. Faced with adding split functionality to the convert-*-to-*.py scripts, I wonder whether this should be added to the GGUFWriter class itself rather than to the convert scripts, since it would be tedious to rewrite every write_tensors method in convert-hf-to-gguf.py.

The counterpoint I can see to doing this is that GGUFWriter should only write one file, since it's GGUFWriter and not GGMLWriter. It would also be very annoying to rewrite GGUFWriter, and I'm hesitant to touch the gguf package as a novice. But it's also likely nobody thought of this scenario when creating the file, so perhaps there's good reason to make these changes in the GGUFWriter class. @phymbert thoughts?

phymbert · 2024-04-28T10:11:54Z

This is already a good start. Could you add an end to end usage in the summary?

christianazinn · 2024-04-28T16:59:45Z

Sure thing (I assume you mean examples of usage and expected outputs).

I also plan to rework the implementation by consolidating code into a new GGUFManager class that handles multiple file writes via multiple GGUFWriter instances, so GGUFWriter still only writes to one file. This is because each Model in convert-hf-to-gguf.py has only one instance of GGUFWriter, so splitting would be nearly impossible there. Usage should remain the same, but the code will be fundamentally altered. (I also imagine this could do things to memory usage, so that will need to be heavily tested.)

christianazinn · 2024-04-28T22:27:58Z

I'll need to implement for convert-llama-ggml-to-gguf.py and convert-persimmon-to-gguf.py soon - what are some models that require those scripts for conversion, so I can test? Also, I see convert-lora-to-ggml.py doesn't even use GGUFWriter - is that just for converting LoRA adapters? Is that something we should even add splitting for, considering the small size of LoRA adapters?

Anyway, GGUFManager is implemented as a near drop-in replacement for GGUFWriter that supports file splitting, so far only in convert.py (migrated from my previous commits); support for convert-hf-to-gguf.py is next up.

slaren · 2024-04-28T22:44:14Z

convert-llama-ggml-to-gguf.py is for conversion of pre-gguf models. At this point it could be removed. convert-lora-to-ggml.py doesn't export to gguf format. convert-persimmon-to-gguf.py should probably be integrated into convert-hf-to-gguf.py, but I don't think it needs to be updated.

christianazinn · 2024-04-29T00:41:03Z

Got it - will only implement for convert-hf-to-gguf.py. Remind me to watch memory usage while converting. Since I'm making changes to the gguf package, how will I push those?

slaren · 2024-04-29T00:53:50Z

You can modify the gguf package in the gguf-py directory in this repository. There are instructions for publishing new releases in https://github.com/ggerganov/llama.cpp/blob/master/gguf-py/README.md.

christianazinn · 2024-04-29T01:29:02Z

You can modify the gguf package in the gguf-py directory in this repository

That's what I've been doing so far; will check out instructions to contribute, thanks!

christianazinn · 2024-04-29T01:59:16Z

Testing on Mistral 7B Instruct, this branch's convert.py takes up approximately the same amount of memory as that of master. Will need to check on larger models since the discrepancy was around 6%, 3.6G vs. 3.4G used at max. Obviously memory plays a major role in splitting larger files, which is the entire point of this PR.

christianazinn · 2024-04-30T01:32:19Z

Running tests on my side for all convert-hf-to-gguf.py supported model architectures. What models fall under QWenLMHeadModel - is that just plain QWen 1?

christianazinn · 2024-05-02T00:20:32Z

christianazinn · 2024-05-05T14:59:06Z

Leaving a note for myself to watch merge conflicts with #6511. Development on this branch has slowed down as I'm pretty busy.

christianazinn · 2024-05-05T19:29:37Z

Noting time to convert baichuan-inc/Baichuan2-7B-Chat.

New branch, --split, --split-max-size 4G:
real 6m27.788s
user 1m15.914s
sys 0m46.017s

New branch, no split:
real 7m17.661s
user 1m18.516s
sys 0m44.285s

master:
real 5m57.387s
user 1m14.567s
sys 0m48.403s

Note that these conversions were done writing the outfile over 2.5GbE, so there was considerable time spent just saving the file. Will test more later, but it doesn't seem like the change increases conversion time too significantly.

mofosyne · 2024-05-09T14:26:47Z

Merge attempted. Some ambiguous lines, so @christianazinn should give this a lookover to make sure the intent is still correct.

christianazinn · 2024-05-09T15:34:00Z

I'll check in a few hours and fix conflicts.

christianazinn · 2024-05-10T01:22:43Z

The new get-vocab-base-pre functionality introduced to convert-hf-to-gguf.py by #6920 is throwing me off, but things look fine for the most part. Push incoming for conflict resolution; testing on Refact for convert-hf-to-gguf.py worked and no fundamental changes are required to convert.py. This will remain approximately dormant for another two weeks or so while I focus on finals, but since the code is already almost all implemented, if other people want to pick up and take this PR to the finish line I'd more than appreciate it.

gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

compilade

I'm satisfied with how this turned out. I did not test this extensively, but from the conversions I tried (with --split-max-size and with no split, both with q8_0 and f16), this worked well.

A future PR to add split model support to GGUFReader would be nice.

gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

gguf-py/gguf/gguf_writer.py

Co-authored-by: compilade <[email protected]>

christianazinn · 2024-06-15T15:29:30Z

Forgot to mark as ready for review. Can probably be merged.

mofosyne · 2024-06-18T12:25:42Z

few days has passed with the merge ready label, ci passed and approval.

Consensus achieved? I'll presume it will be so by the end of the week.

christianazinn · 2024-06-23T06:59:23Z

It's been about a week and I see no dissent so far.

convert-hf-to-gguf.py

Co-authored-by: compilade <[email protected]>

* support splits in convert.py * Support split by size and dry run to write estimated shards/filesizes * Move split functionality to new GGUFManager class * fix improper function signature * tentative push of convert-hf-to-gguf support * resolve merge + SplitArguments for easier parsing * Fix eager tensor memory leak and remove convert.py changes Removed a memory leak caused by unexpected reference retention to eager tensors. Also removed GGUFManager functionality in convert.py in favor of specializing for convert-hf-to-gguf.py. * refactor SplitStrategy to be a deque Instead of having SplitStrategy have a `data` field that is a deque, just have SplitStrategy be a subclass of deque itself. * fix Q8 quantization * remove unnecessary imports in gguf_manager * fix final? merge issue * fix gguf_writer placement and remove comments * oops, actually fix gguf_writer placement * reduce duplicated code from gguf_writer * further simplify GGUFManager * simplify even further and standardize with GGUFWriter * reduce diffs with master * form shards while adding tensors, SHA256 sums agree with master * re-add type hint Co-authored-by: compilade <[email protected]> * GGUFWriter compatibility fix Co-authored-by: compilade <[email protected]> * Shard dataclass and un-negative dont_add_architecture * type consistency in format_n_bytes_to_str * move kv keys to constants.py * make pathlib explicit * base-1024 bytes to base-1000 * rename GGUFManager to GGUFWriterSplit * Update gguf-py/gguf/constants.py Co-authored-by: compilade <[email protected]> * fix convert-hf-to-gguf.py permissions * fix line endings * Update gguf-py/gguf/gguf_writer_split.py Co-authored-by: compilade <[email protected]> * convert-hf : restore executable file permission * examples/convert-legacy-llama.py: restore executable file permission * reinstate original gguf package import and fix type annotation * attempt to appease the linter * attempt 2 to appease the linter * attempt 3 to appease the linter * comma consistency * Update convert-hf-to-gguf.py Co-authored-by: compilade <[email protected]> * edit cmd line args * use simplification from ggerganov#7827 * kv/ti data are still wrong * try to refactor kv data (still fails) * fix ti data messiness * tidy up * fix linting * actually make the linter happy * cleanup round 1 * remove SplitStrategy, SplitArguments * appease linter * fix typing and clean up * fix linting * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * progress bar, fix split logic * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * catch oversights * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * swap bar orders * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * compatibility fix * Update gguf-py/gguf/gguf_writer.py Co-authored-by: compilade <[email protected]> * Update convert-hf-to-gguf.py Co-authored-by: compilade <[email protected]> --------- Co-authored-by: Brian <[email protected]> Co-authored-by: compilade <[email protected]>

christianazinn marked this pull request as draft April 27, 2024 04:24

support splits in convert.py

874c341

christianazinn force-pushed the convert-split branch from 26ebf83 to 874c341 Compare April 27, 2024 18:32

Support split by size and dry run to write estimated shards/filesizes

72cbd4e

Move split functionality to new GGUFManager class

702a744

fix improper function signature

c33bdf3

compilade mentioned this pull request May 3, 2024

convert.py: add python logging instead of print() #6511

Merged

phymbert mentioned this pull request May 4, 2024

gguf-split: add --no-tensor-first-split option #7072

Merged

tentative push of convert-hf-to-gguf support

b7c6120

mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level python python script changes enhancement New feature or request labels May 9, 2024

Merge branch 'master' into convert-split

14b3291

resolve merge + SplitArguments for easier parsing

87a98a5

catch oversights

79bd2bf

compilade reviewed Jun 10, 2024

View reviewed changes

christianazinn and others added 6 commits June 10, 2024 07:54

Update gguf-py/gguf/gguf_writer.py

7eea552

Co-authored-by: compilade <[email protected]>

Update gguf-py/gguf/gguf_writer.py

99f9a24

Co-authored-by: compilade <[email protected]>

Update gguf-py/gguf/gguf_writer.py

ad02c94

Co-authored-by: compilade <[email protected]>

Update gguf-py/gguf/gguf_writer.py

c1b1a29

Co-authored-by: compilade <[email protected]>

Update gguf-py/gguf/gguf_writer.py

4550826

Co-authored-by: compilade <[email protected]>

swap bar orders

efa0609

compilade approved these changes Jun 10, 2024

View reviewed changes

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

christianazinn and others added 3 commits June 10, 2024 13:54

Update gguf-py/gguf/gguf_writer.py

b843445

Co-authored-by: compilade <[email protected]>

Update gguf-py/gguf/gguf_writer.py

854bd64

Co-authored-by: compilade <[email protected]>

compatibility fix

05b183f

compilade reviewed Jun 10, 2024

View reviewed changes

gguf-py/gguf/gguf_writer.py Outdated Show resolved Hide resolved

Update gguf-py/gguf/gguf_writer.py

e9895d2

Co-authored-by: compilade <[email protected]>

christianazinn marked this pull request as ready for review June 15, 2024 15:29

Merge branch 'master' into convert-split

4e4e376

compilade added merge ready indicates that this may be ready to merge soon and is just holding out in case of objections and removed help wanted Extra attention is needed examples labels Jun 15, 2024

compilade reviewed Jun 23, 2024

View reviewed changes

convert-hf-to-gguf.py Outdated Show resolved Hide resolved

mofosyne and others added 2 commits June 23, 2024 19:41

Update convert-hf-to-gguf.py

163712e

Co-authored-by: compilade <[email protected]>

Merge branch 'master' into convert-split

6e4182c

mofosyne merged commit 52fc870 into ggerganov:master Jun 24, 2024
18 checks passed

compilade mentioned this pull request Jun 29, 2024

convert-hf : print output file name when completed #8181

Merged

4 tasks

compilade mentioned this pull request Jul 1, 2024

convert-*.py: GGUF Naming Convention Refactor and Metadata Override Refactor #7499

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to split during conversion #6942

Option to split during conversion #6942

christianazinn commented Apr 27, 2024 •

edited

Loading

christianazinn commented Apr 28, 2024

phymbert commented Apr 28, 2024

christianazinn commented Apr 28, 2024 •

edited

Loading

christianazinn commented Apr 28, 2024

slaren commented Apr 28, 2024

christianazinn commented Apr 29, 2024

slaren commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 30, 2024

christianazinn commented May 2, 2024 •

edited

Loading

christianazinn commented May 5, 2024

christianazinn commented May 5, 2024

mofosyne commented May 9, 2024

christianazinn commented May 9, 2024 •

edited

Loading

christianazinn commented May 10, 2024

compilade left a comment

christianazinn commented Jun 15, 2024

mofosyne commented Jun 18, 2024

christianazinn commented Jun 23, 2024

Option to split during conversion #6942

Option to split during conversion #6942

Conversation

christianazinn commented Apr 27, 2024 • edited Loading

Usage

Example, --split-max-size

Example, --split-max-tensors with --dry-run, --large-first-shard

christianazinn commented Apr 28, 2024

phymbert commented Apr 28, 2024

christianazinn commented Apr 28, 2024 • edited Loading

christianazinn commented Apr 28, 2024

slaren commented Apr 28, 2024

christianazinn commented Apr 29, 2024

slaren commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 29, 2024

christianazinn commented Apr 30, 2024

christianazinn commented May 2, 2024 • edited Loading

christianazinn commented May 5, 2024

christianazinn commented May 5, 2024

mofosyne commented May 9, 2024

christianazinn commented May 9, 2024 • edited Loading

christianazinn commented May 10, 2024

compilade left a comment

Choose a reason for hiding this comment

christianazinn commented Jun 15, 2024

mofosyne commented Jun 18, 2024

christianazinn commented Jun 23, 2024

christianazinn commented Apr 27, 2024 •

edited

Loading

Example, `--split-max-size`

Example, `--split-max-tensors` with `--dry-run`, `--large-first-shard`

christianazinn commented Apr 28, 2024 •

edited

Loading

christianazinn commented May 2, 2024 •

edited

Loading

christianazinn commented May 9, 2024 •

edited

Loading