Failed to compute eigendecomposition #13

aykamko · 2023-11-03T22:32:42Z

We're seeing this error message about 5 minute into training.

WARNING:distributed_shampoo.utils.matrix_functions:Failed to compute eigendecomposition in torch.float32 precision with exception linalg.eigh: The algorithm failed to converge because the input matrix is ill-conditioned or has too many repeated eigenvalues (error code: 1).! Retrying in double precision...

Any ideas how we can fix this / avoid this?

The text was updated successfully, but these errors were encountered:

hjmshi · 2023-11-06T19:50:34Z

Hi @aykamko, thanks for your question! This is a warning related to the eigendecomposition solver failing with lower precision. Are you seeing a subsequent error after this?

This could be due to multiple reasons:

If you're setting the learning rate too large, this can cause inf or nan values to get inserted into the preconditioner matrix. In this case, we would expect increasing the precision (being done after this warning) to not resolve the issue - you'll likely begin to see nan values in the loss after the matrix root inverse is computed.
Alternatively, we have found that the eigendecomposition solver can be unstable, especially for some low-rank matrices. One approach to avoid this is to set a larger start_preconditioning_step, which will ensure that the matrix is more well-behaved prior to applying the eigh solver.

It would be helpful if you could provide your current configuration of Shampoo as well as the previous optimizer configuration you were using previously for your model. We can also help with setting the appropriate hyperparameters for your case.

aykamko · 2023-11-11T00:14:22Z

Thanks for the response!

Previous config used AdamW:

lr = 1e-4
betas = (0.9, 0.999)
eps = 1e-8
weight_decay = 1e-2

Current Shampoo config:

        lr: 1e-4
        betas: [0.9, 0.999]
        epsilon: 1e-12
        weight_decay: 1e-02
        max_preconditioner_dim: 8192
        precondition_frequency: 100
        use_decoupled_weight_decay: True
        grafting_type: 4  # GraftingType.ADAM
        grafting_epsilon: 1e-08
        grafting_beta2: 0.999

In the meantime, I'll try to set a larger start_preconditioning_step.

I also saw this warning in your README:

Note: We have observed known instabilities with the torch.linalg.eigh operator on CUDA 11.6-12.1, specifically for low-rank matrices, which may appear with using a small start_preconditioning_step. Please avoid these versions of CUDA if possible. See: pytorch/pytorch#94772.

We have CUDA 12.2 driver installed, but our PyTorch is built for 12.1 (downloaded from pip). Could that be the issue?

hjmshi · 2023-11-15T21:12:53Z

@aykamko, the settings look right here. Let's see what happens with a larger preconditioning step. 😊

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to compute eigendecomposition #13

Failed to compute eigendecomposition #13

aykamko commented Nov 3, 2023

hjmshi commented Nov 6, 2023 •

edited

Loading

aykamko commented Nov 11, 2023 •

edited

Loading

hjmshi commented Nov 15, 2023

Failed to compute eigendecomposition #13

Failed to compute eigendecomposition #13

Comments

aykamko commented Nov 3, 2023

hjmshi commented Nov 6, 2023 • edited Loading

aykamko commented Nov 11, 2023 • edited Loading

hjmshi commented Nov 15, 2023

hjmshi commented Nov 6, 2023 •

edited

Loading

aykamko commented Nov 11, 2023 •

edited

Loading