-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions about defining a subnet by disabling gradients #217
Comments
|
if self.subnetwork_indices is not None: | |
Ji = Ji[:, self.subnetwork_indices] |
It computes the full Jacobian (very large!) and just slicing it with the subnet mask.
Of course, for more sophisticated subnet selection, SubnetLaplace
is still desirable due to the existence of many helper functions, see https://github.com/aleximmer/Laplace/blob/main/laplace/utils/subnetmask.py
But this is quite orthogonal to the implementation to SubnetLaplace
itself, i.e. one can implement it by taking a subnet mask and switching off the grad of the params not in the mask.
Last-layer Laplace
For last-layer Laplace, it's still preferable to use LLLaplace
since it's highly optimized. E.g. the Jacobian is computed in a special way unlike SubnetLaplace
. The example bit you referred to is just for intuition purpose :)
Sampling
The sample
method of Laplace takes into account the disabled gradients. First, the parameters held by Laplace and hence self.n_params
are just those with requires_grad = True
:
Laplace/laplace/baselaplace.py
Lines 114 to 123 in 553cf7c
# Only do Laplace on params that require grad | |
self.params: list[torch.Tensor] = [] | |
self.is_subset_params: bool = False | |
for p in model.parameters(): | |
if p.requires_grad: | |
self.params.append(p) | |
else: | |
self.is_subset_params = True | |
self.n_params: int = sum(p.numel() for p in self.params) |
Then, in self.sample
, Laplace generate sample for those self.n_params
only, e.g.:
Laplace/laplace/baselaplace.py
Lines 1495 to 1503 in 553cf7c
def sample( | |
self, n_samples: int = 100, generator: torch.Generator | None = None | |
) -> torch.Tensor: | |
samples = torch.randn( | |
n_samples, self.n_params, device=self._device, generator=generator | |
) | |
# (n_samples, n_params) x (n_params, n_params) -> (n_samples, n_params) | |
samples = samples @ self.posterior_scale | |
return self.mean.reshape(1, self.n_params) + samples |
Then in self._nn_predictive_samples
Laplace simply does
Laplace/laplace/baselaplace.py
Line 1168 in 553cf7c
vector_to_parameters(sample, self.params) |
Note that self.params
is a reference to the subset of model.parameters()
. Calling the above equals updating that subset of params with the sampled params.
I might have missed some of your questions. So please just repeat below in that case, or if you have any follow up questions. |
Thanks for the detailed answer, that's highly appreciated. The explanation of last layer by disabling grads is very helpful. I wasn't aware of the fact that
That's a good point. However, it looks as if this doesn't work for helpers that operate on individual weights across param tensors, for instance: import torch as T
from laplace.utils import subnetmask as su
model = T.nn.Sequential(T.nn.Linear(2, 20), T.nn.ReLU(), T.nn.Linear(20, 3))
params = T.nn.utils.parameters_to_vector(model.parameters())
subnetmask = su.LargestMagnitudeSubnetMask(
model=model, n_params_subnet=int(len(params) * 0.8)
)
fixed_mask = T.ones(len(params), dtype=bool)
fixed_mask[subnetmask.select()] = False
# RuntimeError: you can only change requires_grad flags of leaf variables.
params[fixed_mask].requires_grad = False This is because one can, from my understanding, only disable grads on a tensor level and not for single entries. There is probably an obvious solution, but at the moment I can't think of any. I'd appreciate any hints here. Thanks. |
Another question I wanted to ask again is: Since |
Good point on disabling grad on the "tensor" level (or more accurately, on the
I haven't tested this in-depth, but my hunch is that last-layer Laplace via disabling grads is more universally applicable than However, again, there are some trade-offs here. For instance, |
Very valuable feedback again, thanks a bunch. So to summarize
Is this about right? I feel that this list could go into the documentation. I'm happy to add this somewhere. If you think this is useful, let me know and I'll keep this issue open until then. However if things are being tested and in flux ATM such that documenting it is not worth it, then I'll close this issue and keep it as a temporary reference. Thanks. |
I agree that documentation would be good. What would be useful is to distinguish disable-grad &
The current documentation website with its single-page layout is not very good UX-wise. So if you want to document this discussion, feel free to do so in the README. I just rewrote my personal site & blog using Astro and had a very good experience. I might think of finding a way to migrate Laplace's documentation to Starlight soon. |
I have some questions regarding the method of defining a subnetwork by disabling grads, termed here "subnet-by-disable-grads".
Please excuse the long-ish text below. If there is another place where questions rather than bug reports should be discussed, please let me know (e.g. github discussions).
From #213:
I guess you are referring to this part of the docs and these tests. I assumed that this is just another way of selecting a subset of weights, but that is not the case as it seems.
When using
SubnetLaplace
, I do see a speedup when doingla.fit(...)
and in hyperparameter optimization. So I assume the calculation of the full Jacobian you refer to happens during calls to_glm_predictive_distribution()
, where one won't save compute and memory?Is this also true for
subset_of_weights="last_layer"
, i.e.LLLaplace
? I guess not since the code looks as if it cuts out the last layer first and then operates only on that. If so, then what would be the benefit of using "subnet-by-disable-grads" overLLLaplace
?There is
laplace.utils.subnetmask.LastLayerSubnetMask
, but I guess that is only for testing things given that there are two other methods for defining a last-layer subnet (LLLaplace
and subnet-by-disable-grads).The examples for "subnet-by-disable-grads" I have seen so far seem to focus on last layer, or more general disabling grads layer-wise. Since one cannot use any of the
Largest*SubnetMask
orRandomSubnetMask
to disable individual weight's grads in a tensor via aparameters_to_vector()
+vector_to_parameters()
round trip, the method seems to be limited to doinglast_layer
-type subnet selection. Is this correct?The test in #216 checks that in case of using
SubnetLaplace
, only non-fixed parameters vary across samples. I think this behavior is unique toSubnetLaplace
-based selections, since thesample()
method ofLLLaplace
and its subclasses return only samples of the last layer weights, which is corrected for in_nn_predictive_samples()
as far as I can see. I wonder ifsample()
in case of subnet-by-disable-grads is aware of disabled grads, since all methods which are not in subclasses ofSubnetLaplace
orLLLaplace
seem to generate "just" vectors of lengthn_params
such that parameter samples would also vary for fixed parameters.Thanks.
The text was updated successfully, but these errors were encountered: