Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml: fix gradient allocation logic #966

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

JohannesGaessler
Copy link
Collaborator

On master the general logic for determining whether a tensor should receive gradients is as follows: parameters are given gradients, and tensors where at least one source has gradients are also given gradients. This works correctly for the forward pass. But for the backward pass this logic is unfortunately incorrect because for many operations the backwards pass uses the same operations as the forward pass and also tensors from the forward pass as sources. As a consequence gradients are determined to also need gradients and this propagates for the rest of the backward pass. The consequence is that with the code on master a lot of extra tensors are created and allocated that are not actually needed for anything. With code making use of ggml_backend_sched there is no excessive memory allocation because only the tensors in a specific graph are allocated but the correctly allocated tensors have pointers to unallocated tensors which then causes problems with ggml_graph_reset.

I think the correct way to fix these problems is to change the logic for determining whether or not a tensor should receive gradients upon creation. First, explicitly mark gradients as such with a tensor flag. During the backwards pass at least one of the source tensors will be the gradients of another tensor, so in those cases gradients for the newly created tensor are never added. Otherwise, use the same logic as on master where gradients are added if at least one source has gradients.

Unfortunately the logic on master is currently duplicated for each GGML op so the above change requires changing a large number of lines in ggml.c. I wrote a small utility function ggml_set_grad that can be applied after tensor creation to add gradients since the logic should be the same regardless of the specific GGML op. This function also asserts that the operation is not in-place since this is currently not being handled correctly (on master the combination of in-place operations and gradients sometimes causes a failed assert and sometimes just discards the gradients). Note that even without the changes in this PR a function like ggml_set_grad will likely become necessary in the future anyways for specifying different data types for gradients and weights.

While going through the code I also fixed the formatting as best as I could.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Sep 23, 2024

I forgot: there is a similar issue when replacing the original gradient tensors during backwards graph construction when not using gradient accumulation. The original gradient tensors with GGML_OP_NONE are replaced with tensors that actually calculate the gradients but the original gradient tensors are not freed and are thus wasting memory (unless you're using ggml_backend_sched). This is not addressed in this PR.

@slaren
Copy link
Collaborator

slaren commented Sep 23, 2024

Would it be simpler to add a flag to ggml_context to skip gradients?

@JohannesGaessler
Copy link
Collaborator Author

Do you mean skip their creation or skip their allocation?

@slaren
Copy link
Collaborator

slaren commented Sep 23, 2024

The creation. It would be a flag such as no_grad that ops would check before creating the grad tensor. During backwards expansion it would be set automatically.

@JohannesGaessler
Copy link
Collaborator Author

That would work for eliminating the need for a tensor flag but it would still require a change in the function for each GGML op and I personally think it would be preferable not to add state to ggml_context if it can be avoided. Currently the creation of tensors is fully determined by the inputs to the corresponding functions (except for resource allocation) and I think that that is a design that is easier to debug than one where the results also depend on the internal state of ggml_context.

@slaren
Copy link
Collaborator

slaren commented Sep 23, 2024

Generally I would agree that it is preferable to have pure functions that have no state, but this is a fairly simple state. I have some issues with this approach:

  • It will add overhead even during inference, which is something we need to reduce
  • "To ignore tensors for gradients add them after calling this function" seems like a hack, and it will prevent refactoring in the future (eg. I would like to make the sources a parameter to ggml_new_tensor, but this will make that harder)
  • The logic that "if any source is a gradient the newly created tensor must also be a gradient" is not very obvious to me

I think that adding a no_grad state would be simpler overall, but maybe there are better options.

@JohannesGaessler
Copy link
Collaborator Author

How about this: remove the gradient logic from the forward pass construction completely and instead replace it with a pass over the forward graph in ggml_build_backward. That way gradients will not be added to gradients, there is no need for an internal ggml_context state, and inference should become faster because it would eliminate the gradient checks on master. The downside would be that the code for determining which tensors should get gradients would no longer be in the same place as the code that defines the tensor parameterization.

@slaren
Copy link
Collaborator

slaren commented Sep 23, 2024

That sounds good to me. I am assuming that not too many operations would require specific handling (to exclude some of its parameters I imagine), but either way that could be refactored in the future into objects (or types) that have all the details of an operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants