-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml: fix gradient allocation logic #966
base: master
Are you sure you want to change the base?
ggml: fix gradient allocation logic #966
Conversation
I forgot: there is a similar issue when replacing the original gradient tensors during backwards graph construction when not using gradient accumulation. The original gradient tensors with |
Would it be simpler to add a flag to |
Do you mean skip their creation or skip their allocation? |
The creation. It would be a flag such as |
That would work for eliminating the need for a tensor flag but it would still require a change in the function for each GGML op and I personally think it would be preferable not to add state to |
Generally I would agree that it is preferable to have pure functions that have no state, but this is a fairly simple state. I have some issues with this approach:
I think that adding a |
How about this: remove the gradient logic from the forward pass construction completely and instead replace it with a pass over the forward graph in |
That sounds good to me. I am assuming that not too many operations would require specific handling (to exclude some of its parameters I imagine), but either way that could be refactored in the future into objects (or types) that have all the details of an operation. |
On master the general logic for determining whether a tensor should receive gradients is as follows: parameters are given gradients, and tensors where at least one source has gradients are also given gradients. This works correctly for the forward pass. But for the backward pass this logic is unfortunately incorrect because for many operations the backwards pass uses the same operations as the forward pass and also tensors from the forward pass as sources. As a consequence gradients are determined to also need gradients and this propagates for the rest of the backward pass. The consequence is that with the code on master a lot of extra tensors are created and allocated that are not actually needed for anything. With code making use of
ggml_backend_sched
there is no excessive memory allocation because only the tensors in a specific graph are allocated but the correctly allocated tensors have pointers to unallocated tensors which then causes problems withggml_graph_reset
.I think the correct way to fix these problems is to change the logic for determining whether or not a tensor should receive gradients upon creation. First, explicitly mark gradients as such with a tensor flag. During the backwards pass at least one of the source tensors will be the gradients of another tensor, so in those cases gradients for the newly created tensor are never added. Otherwise, use the same logic as on master where gradients are added if at least one source has gradients.
Unfortunately the logic on master is currently duplicated for each GGML op so the above change requires changing a large number of lines in
ggml.c
. I wrote a small utility functionggml_set_grad
that can be applied after tensor creation to add gradients since the logic should be the same regardless of the specific GGML op. This function also asserts that the operation is not in-place since this is currently not being handled correctly (on master the combination of in-place operations and gradients sometimes causes a failed assert and sometimes just discards the gradients). Note that even without the changes in this PR a function likeggml_set_grad
will likely become necessary in the future anyways for specifying different data types for gradients and weights.While going through the code I also fixed the formatting as best as I could.