Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuBLAS: keep the weights in VRAM when possible #1269

Closed
slaren opened this issue May 1, 2023 · 11 comments
Closed

cuBLAS: keep the weights in VRAM when possible #1269

slaren opened this issue May 1, 2023 · 11 comments
Labels
enhancement New feature or request performance Speed related topics

Comments

@slaren
Copy link
Collaborator

slaren commented May 1, 2023

Currently, the weights are uploaded to VRAM every time that they are used. This usually represents ~30% of the time spent in most mat muls with cuBLAS:

image

This could be improved by keeping the weights VRAM, however there are two issues blocking this:

  1. We need to be able to identify with certainty what tensors are constant / weights
  2. We need to be able to identify when the weights change. With llama.cpp, this can happen after applying a LoRA

Issue 1 can be solved in a not very clean way with #1268 by looking at the matrices that have names ending in ".weight". We could also add some kind of is_weight flag to ggml_tensor, but that would be more intrusive.

Issue 2 is more complicated. We would need either to add some kind of is_dirty flag to ggml_tensor that would be set automatically by the operations that modify a tensor (such as ggml_cpy and _inplace ops), or we could add a global flag to ggml-cuda that would trigger a full weight re-upload, that would need to be set by the application when the weights change.

The is_dirty flag would be more intrusive to ggml, essentially we would be adding GPU-specific details to ggml, but would be automatic for downstream users. The global CUDA-only flag would be less intrusive to ggml, but would force users to deal with this manually.

Any thoughts about this? What would be the best way to add this to ggml, while not interfering with the general goal of not adding GPU-specific details to ggml?

@ggerganov
Copy link
Owner

Issue 1 should be solved by checking if the op is GGML_OP_NONE:

llama.cpp/ggml.h

Lines 322 to 325 in c0335b5

// compute data
enum ggml_op op;

This indicates that the tensor is not the result of some operator.

In theory, it can be an optimization parameter (this is not relevant for inference, but we should check anyway).
A tensor becomes an optimization parameter by marking it with the ggml_set_param() function:

llama.cpp/ggml.h

Lines 738 to 742 in c0335b5

GGML_API void ggml_set_param(
struct ggml_context * ctx,
struct ggml_tensor * tensor);

In this case, the grad member becomes non-NULL.

Therefore, the CUDA code should check: tensor->op == GGML_OP_NONE && tensor->grad == NULL to determine if the tensor is a constant.

Will think more about issue 2

@slaren
Copy link
Collaborator Author

slaren commented May 1, 2023

Issue 1 should be solved by checking if the op is GGML_OP_NONE

I think this would work for llama.cpp, but I think it may fail with out-of-graph, but not constant tensors. For example, the k/v cache. In our case it should still work because the k/v tensors go through some view/reshape/permute operations, so technically it isn't interpreted as a constant by the time it reaches the mat mul. But I think it wouldn't be reliable in every case.

In theory, it can be an optimization parameter (this is not relevant for inference, but we should check anyway). A tensor becomes an optimization parameter by marking it with the ggml_set_param() function

This sounds exactly like what we need, but from what I see, ggml_set_param duplicates the tensor in tensor->grad. Wouldn't this double memory usage?

@ggerganov
Copy link
Owner

For example, the k/v cache

Good point. We can add an enum value GGML_NOP or GGML_OP_INP or GGML_OP_VAR (not sure what is the best name) and you would set the op member of such tensors to that, just to mark that it's value can change externally.
But probably we need some better interface for that, so the developer does not forget to do it.

The ggml_set_param is not suitable because as you noted it creates the grad tensor which doubles the memory.
Maybe we can instead introduce ggml_set_var() or ggml_set_inp() that just sets the op member to that new enum value


An alternative solution is adding a bool is_var flag to the tensor instead of the enum value. This has the advantage of not having to handle the new enum value in the graph functions. The GPU code can then check:

tensor->op == GGML_OP_NONE && tensor->grad == NULL && !tensor->is_var

to determine if the tensor is constant. And we can also update ggml_set_param() to also set tensor->is_var = true so the GPU check would become just: tensor->op == GGML_OP_NONE && !tensor->is_var


Regarding issue 2:
The is_dirty flag + update of all ggml functions does not seem like a good solution. It is indeed intrusive and error-prone.

I am thinking about adding the is_dirty flag to the tensor, but require the dev to set it manually for each tensor (e.g. ggml_set_dirty()). The flags will be cleared on every ggml_graph_compute() call by the main implementation. The GPU code will make use of this as you specified. This will allow more fine-grained control compared to the global flag where you would have to update all tensors even if only one has changed.

@gjmulder gjmulder added enhancement New feature or request performance Speed related topics labels May 2, 2023
@slaren
Copy link
Collaborator Author

slaren commented May 2, 2023

What if we set is_dirty automatically in all the dst tensors in ggml_graph_compute or ggml_compute_forward? This may fail with tensors that reuse the data of a different tensor, like views or reshapes, but that could be solved by adding a ggml_tensor * owner field that would be set by these operations.

@ggerganov
Copy link
Owner

Could work, but you will still need the ggml_set_dirty() because it is likely that we will want to sometimes modify a tensor by writing to the data directly, instead of going through ggml ops.

Also, it is not obvious when you will clear the flag.
You cannot clear it at the end of ggml_graph_compute() because some tensor could have just became dirty.
You cannot clear the srcs dirty flags in ggml_compute_forward() because they might be part of later ops.
So we will need to maintain a separate of "new dirty" flags for each compute I think.

@slaren
Copy link
Collaborator Author

slaren commented May 2, 2023

The flag would be cleared in the GPU code after uploading the weights to VRAM. The logic would be something like this:

if (!cached || t->is_dirty || (t->owner && t->owner->is_dirty)) {
    upload_tensor(t);
    t->is_dirty = false;
}

@ggerganov
Copy link
Owner

The owner could probably be avoided by traversing the sources of a tensor, along with their sources, etc. The logic is that if a tensor is a result of an operation involving a dirty tensor, then it is also dirty.

Let's start with manually setting is_dirty via ggml API and clearing it after ggml_graph_compute().
The automatic is_dirty detection could be added after that, but at the moment I feel it is an unnecessary complication.
Will think about it more though

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 5, 2023

Could it be done on a higher level from llama.cpp? Just like it manages the scratch buffers or the KV cache. It also knows exactly what order the weights are used and it could start loading the next layer's weights ahead of time, or even it could do that in the end: load the first ones again for the next evaluation.

@slaren
Copy link
Collaborator Author

slaren commented May 5, 2023

llama.cpp builds a graph which is then executed by ggml, it is not synchronous, so these instructions to "prefetch" the next layer weights would have to be in the graph, which would require adding GPU-specific ops to ggml. Eventually, I think what would be the fastest is hybrid GPU/CPU inference, in which some layers are processed entirely in the GPU (as much as the available VRAM allows) and the rest in the CPU. But currently that would have to be in a fork.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented May 5, 2023

It could be useful to prefetch data in other environments as well, low RAM devices: fetch from disk, low disk space: fetch from network, fetch from database.

It could also be used to dequantize some formats where the whole tensor has to be processed instead of one row.

It's a thought I had.

@slaren
Copy link
Collaborator Author

slaren commented May 5, 2023

I think it's a good idea, I am just not sure if it would fit in ggml as long as the goal is to keep GPU stuff separated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Speed related topics
Projects
None yet
Development

No branches or pull requests

4 participants