Questions about defining a subnet by disabling gradients #217

elcorto · 2024-08-05T10:09:00Z

I have some questions regarding the method of defining a subnetwork by disabling grads, termed here "subnet-by-disable-grads".
Please excuse the long-ish text below. If there is another place where questions rather than bug reports should be discussed, please let me know (e.g. github discussions).

From #213:

I have to admit that we don't pay too much attention to the SubnetLaplace implementation anymore since the same can be done more intuitively by switching off the unwanted subset's grads. The benefit of the latter is that the Jacobian for the linearized predictive will only be computed over the selected subset. Meanwhile, the former computes the full Jacobian first and then slices it using the subnet indices.

I guess you are referring to this part of the docs and these tests. I assumed that this is just another way of selecting a subset of weights, but that is not the case as it seems.

When using SubnetLaplace, I do see a speedup when doing la.fit(...) and in hyperparameter optimization. So I assume the calculation of the full Jacobian you refer to happens during calls to _glm_predictive_distribution(), where one won't save compute and memory?

Is this also true for subset_of_weights="last_layer", i.e. LLLaplace? I guess not since the code looks as if it cuts out the last layer first and then operates only on that. If so, then what would be the benefit of using "subnet-by-disable-grads" over LLLaplace?

There is laplace.utils.subnetmask.LastLayerSubnetMask, but I guess that is only for testing things given that there are two other methods for defining a last-layer subnet (LLLaplace and subnet-by-disable-grads).

The examples for "subnet-by-disable-grads" I have seen so far seem to focus on last layer, or more general disabling grads layer-wise. Since one cannot use any of the Largest*SubnetMask or RandomSubnetMask to disable individual weight's grads in a tensor via a parameters_to_vector() + vector_to_parameters() round trip, the method seems to be limited to doing last_layer-type subnet selection. Is this correct?

The test in #216 checks that in case of using SubnetLaplace, only non-fixed parameters vary across samples. I think this behavior is unique to SubnetLaplace-based selections, since the sample() method of LLLaplace and its subclasses return only samples of the last layer weights, which is corrected for in _nn_predictive_samples() as far as I can see. I wonder if sample() in case of subnet-by-disable-grads is aware of disabled grads, since all methods which are not in subclasses of SubnetLaplace or LLLaplace seem to generate "just" vectors of length n_params such that parameter samples would also vary for fixed parameters.

Thanks.

The text was updated successfully, but these errors were encountered:

wiseodd · 2024-08-05T19:47:01Z

`SubnetLaplace` vs disabling grads

The way I see it, disabling gradient is another way to implement the subnet Laplace. You can just switch off the grad of the parameter you don't want (instead of providing a subnet mask), and the backend will automatically compute the Hessian and Jacobian only for the parameters which have require_grad = True. This is why Laplace is applicable to LLM at all (switch off grads other than the LoRA params' and do Laplace as usual) since it's basically emulating SubnetLaplace.

The main problem of the current implementation of SubnetLaplace is that the Jacobian computation for the GLM predictive is done as I said in #213:

Laplace/laplace/curvature/asdl.py

Lines 92 to 93 in 553cf7c

    
           if self.subnetwork_indices is not None: 
        
               Ji = Ji[:, self.subnetwork_indices]

It computes the full Jacobian (very large!) and just slicing it with the subnet mask.

Of course, for more sophisticated subnet selection, SubnetLaplace is still desirable due to the existence of many helper functions, see https://github.com/aleximmer/Laplace/blob/main/laplace/utils/subnetmask.py
But this is quite orthogonal to the implementation to SubnetLaplace itself, i.e. one can implement it by taking a subnet mask and switching off the grad of the params not in the mask.

Last-layer Laplace

For last-layer Laplace, it's still preferable to use LLLaplace since it's highly optimized. E.g. the Jacobian is computed in a special way unlike SubnetLaplace. The example bit you referred to is just for intuition purpose :)

Sampling

The sample method of Laplace takes into account the disabled gradients. First, the parameters held by Laplace and hence self.n_params are just those with requires_grad = True:

Laplace/laplace/baselaplace.py

Lines 114 to 123 in 553cf7c

    
           # Only do Laplace on params that require grad 
        
           self.params: list[torch.Tensor] = [] 
        
           self.is_subset_params: bool = False 
        
           for p in model.parameters(): 
        
               if p.requires_grad: 
        
                   self.params.append(p) 
        
               else: 
        
                   self.is_subset_params = True 
        
           self.n_params: int = sum(p.numel() for p in self.params)

Then, in self.sample, Laplace generate sample for those self.n_params only, e.g.:

Laplace/laplace/baselaplace.py

Lines 1495 to 1503 in 553cf7c

    
           def sample( 
        
               self, n_samples: int = 100, generator: torch.Generator | None = None 
        
           ) -> torch.Tensor: 
        
               samples = torch.randn( 
        
                   n_samples, self.n_params, device=self._device, generator=generator 
        
               ) 
        
               # (n_samples, n_params) x (n_params, n_params) -> (n_samples, n_params) 
        
               samples = samples @ self.posterior_scale 
        
               return self.mean.reshape(1, self.n_params) + samples

Then in self._nn_predictive_samples Laplace simply does

Laplace/laplace/baselaplace.py

Line 1168 in 553cf7c

vector_to_parameters(sample, self.params)

Note that self.params is a reference to the subset of model.parameters(). Calling the above equals updating that subset of params with the sampled params.

wiseodd · 2024-08-05T19:47:47Z

I might have missed some of your questions. So please just repeat below in that case, or if you have any follow up questions.

elcorto · 2024-08-06T11:20:49Z

Thanks for the detailed answer, that's highly appreciated.

The explanation of last layer by disabling grads is very helpful. I wasn't aware of the fact that self.params only contains the active ones. Now it is clear why sample() produces vectors of active params only, as in the LLLaplace case. This is different from SubnetLaplace, which always produces samples of active + fixed params (what the test in #216 checks for) .

SubnetLaplace vs disabling grads

[...]

Of course, for more sophisticated subnet selection, SubnetLaplace is still desirable due to the existence of many helper functions, see https://github.com/aleximmer/Laplace/blob/main/laplace/utils/subnetmask.py But this is quite orthogonal to the implementation to SubnetLaplace itself, i.e. one can implement it by taking a subnet mask and switching off the grad of the params not in the mask.

That's a good point. However, it looks as if this doesn't work for helpers that operate on individual weights across param tensors, for instance:

import torch as T
from laplace.utils import subnetmask as su

model = T.nn.Sequential(T.nn.Linear(2, 20), T.nn.ReLU(), T.nn.Linear(20, 3))

params = T.nn.utils.parameters_to_vector(model.parameters())
subnetmask = su.LargestMagnitudeSubnetMask(
    model=model, n_params_subnet=int(len(params) * 0.8)
)
fixed_mask = T.ones(len(params), dtype=bool)
fixed_mask[subnetmask.select()] = False

# RuntimeError: you can only change requires_grad flags of leaf variables.
params[fixed_mask].requires_grad = False

This is because one can, from my understanding, only disable grads on a tensor level and not for single entries.

There is probably an obvious solution, but at the moment I can't think of any. I'd appreciate any hints here. Thanks.

elcorto · 2024-08-08T09:38:06Z

Another question I wanted to ask again is: Since LLLaplace and disabling all but the last layer's grads seem to do the same in effect, which of the methods would you recommend?

wiseodd · 2024-08-08T20:00:05Z

Good point on disabling grad on the "tensor" level (or more accurately, on the torch.nn.Parameter level). In this case, SubnetLaplace is more flexible.

Since LLLaplace and disabling all but the last layer's grads seem to do the same in effect, which of the methods would you recommend?

I haven't tested this in-depth, but my hunch is that last-layer Laplace via disabling grads is more universally applicable than LLLaplace. Notice that in the implementation of LLLaplace we have to do many extra steps like creating FeatureExtractor, inferring the last-layer, dealing with feature reduction, etc. When done by disabling grads, one doesn't have to worry about them.

However, again, there are some trade-offs here. For instance, LLLaplace enables last-layer Laplace specific tricks such as in the GLM predictive #145

elcorto · 2024-08-09T16:23:09Z

Very valuable feedback again, thanks a bunch. So to summarize

subnet selection by disabling grads is more efficient than SubnetLaplace since it avoids calculating full Jacobians
for last layer selection, there are two other methods, namely LLLaplace and laplace.utils.subnetmask.LastLayerSubnetMask (the latter is probably only for testing purposes)
disabling grads on Parameter level doesn't cover all cases that SubnetLaplace offers such as Largest*SubnetMask or RandomSubnetMask
depending on the selection method (disable grads, SubnetLaplace, LLLaplace), sample() returns vectors of different lengths, but this is always corrected for in _nn_predictive_samples()
LLLaplace offers improved performance (Add fast computation of functional_variance for DiagLLLaplace and KronLLLaplace #145)

Is this about right?

I feel that this list could go into the documentation. I'm happy to add this somewhere. If you think this is useful, let me know and I'll keep this issue open until then. However if things are being tested and in flux ATM such that documenting it is not worth it, then I'll close this issue and keep it as a temporary reference. Thanks.

wiseodd · 2024-08-09T18:53:49Z

I agree that documentation would be good. What would be useful is to distinguish disable-grad & SubnetLaplace, LLLaplace by applications. For example:

Disable-grad: Laplace on specific types of layer/parameter, e.g. in LLM with LoRA.
SubnetLaplace, LLLaplace: If more fine-grained partitioning is desired, as you have written above.

The current documentation website with its single-page layout is not very good UX-wise. So if you want to document this discussion, feel free to do so in the README.

I just rewrote my personal site & blog using Astro and had a very good experience. I might think of finding a way to migrate Laplace's documentation to Starlight soon.

elcorto mentioned this issue Aug 12, 2024

Document pros and cons of subnet selection methods #218

Merged

wiseodd closed this as completed in #218 Aug 14, 2024

elcorto mentioned this issue Aug 14, 2024

Make activation applied to the output layer optional mala-project/mala#567

Open

wiseodd mentioned this issue Sep 2, 2024

OOM when try to fit LA on ViT #232

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about defining a subnet by disabling gradients #217

Questions about defining a subnet by disabling gradients #217

elcorto commented Aug 5, 2024 •

edited

Loading

wiseodd commented Aug 5, 2024

wiseodd commented Aug 5, 2024

elcorto commented Aug 6, 2024 •

edited

Loading

`SubnetLaplace` vs disabling grads

elcorto commented Aug 8, 2024

wiseodd commented Aug 8, 2024 •

edited

Loading

elcorto commented Aug 9, 2024

wiseodd commented Aug 9, 2024

Questions about defining a subnet by disabling gradients #217

Questions about defining a subnet by disabling gradients #217

Comments

elcorto commented Aug 5, 2024 • edited Loading

wiseodd commented Aug 5, 2024

SubnetLaplace vs disabling grads

Last-layer Laplace

Sampling

wiseodd commented Aug 5, 2024

elcorto commented Aug 6, 2024 • edited Loading

SubnetLaplace vs disabling grads

elcorto commented Aug 8, 2024

wiseodd commented Aug 8, 2024 • edited Loading

elcorto commented Aug 9, 2024

wiseodd commented Aug 9, 2024

elcorto commented Aug 5, 2024 •

edited

Loading

`SubnetLaplace` vs disabling grads

elcorto commented Aug 6, 2024 •

edited

Loading

`SubnetLaplace` vs disabling grads

wiseodd commented Aug 8, 2024 •

edited

Loading