Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

split: include the option in ./convert.py and quantize #6260

Open
phymbert opened this issue Mar 23, 2024 · 9 comments
Open

split: include the option in ./convert.py and quantize #6260

phymbert opened this issue Mar 23, 2024 · 9 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed split GGUF split model sharding

Comments

@phymbert
Copy link
Collaborator

phymbert commented Mar 23, 2024

Context

At the moment it is only possible to split after convertion or quantization. Mentionned by @Artefact2 in this [comment](https://github.com/ggerganov/llama.cpp/pull/6135#issuecomment-2003942162):

as an alternative, add the splitting logic directly to tools that produce ggufs, like convert.py and quantize.

Proposition

Include split options in convert*.py, support splits in quantize

@phymbert phymbert added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers need feedback Testing and feedback with results are needed split GGUF split model sharding labels Mar 23, 2024
@phymbert
Copy link
Collaborator Author

@ggerganov not urgent at all, but we might keep this in mind. I have added labels good first issue, feel free to remove it.

@phymbert phymbert removed the need feedback Testing and feedback with results are needed label Mar 23, 2024
@ggerganov
Copy link
Owner

Yes, creating good first issues is encouraged so more people can get involved in the project

@christianazinn
Copy link
Contributor

I'd like to work on this as a first issue; can I be assigned? And how much has been implemented already in resolving #6548? It looks like that's just adding support for writing to shards when quantizing existing shards, rather than writing to shards in general, but even so some of the implementation could probably be used.

@phymbert
Copy link
Collaborator Author

Hello, I believe that for quantize, the new --keep-split option is enough thanks to @zj040045 .

But yes, it would be nice to generate shards at the convert time.

Feel free to submit a PR.

@christianazinn
Copy link
Contributor

The implementation of --keep-split currently keeps the number of shards constant, but I imagine there's a use case for quantizing an unsplit high precision file to multiple shards. Once splitting is implemented at convert time, this will be less of an issue, but perhaps still desirable. Thoughts?

@christianazinn
Copy link
Contributor

Preliminary observations after some attempts: This is considerably harder to implement for convert*.py than for quantize, since the conversion scripts are in Python, not C++. I've gotten at least a dozen different errors so far and can conclude that using the GGUFWriter class is not likely to work.

I figure I'll need to write the conversion method such that it writes the tensors to shards as it converts them, but to do that, I need to know how to format those shards, and it's faster to ask than try to parse the sparsely commented code. It appears that the naive implementation, just having each shard other than the first be purely comprised of tensors, isn't what's going on, so I'd like some clarification - @phymbert I believe you wrote the gguf-split code?

When the files are split, does llama.cpp expect each shard to:

  • each have a copy of the header, or should that all be in just the first shard? What about for other metadata (kv entries)?
  • each have only gguf_tensor_info about the tensors they contain, or should the first shard contain all tensors' info?
  • overall have any kv entries other than the default? (e.g. I see LLM_KV_SPLIT_NO and so on.)

In general, how does llama.cpp expect to see the data formatted within the shards?

Apologies for the questions, just catching up to speed.

@phymbert
Copy link
Collaborator Author

In general, how does llama.cpp expect to see the data formatted within the shards?

Each shard is a valid GGUF. The first aproach is to create a gguf per batch of tensor.

@christianazinn
Copy link
Contributor

I see, thanks. (I had actually solved my own problem not long after posting the question and now I feel foolish. PR forthcoming.)

@phymbert
Copy link
Collaborator Author

no worries. keep tryin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed split GGUF split model sharding
Projects
None yet
Development

No branches or pull requests

3 participants