[RLlib] - Fix numerical overflow in gradient clipping for (many) large gradients #45055

simonsays1980 · 2024-04-30T13:42:03Z

Why are these changes needed?

Large gradients and many of these could lead to numerical overflow when computing their l2-norm in torch_utils.clip_gradients (using the "global_norm"). This is counterproductive as a user wants to clip such gradients and instead runs into numerical overflow because of clipping gradients.

This PR proposes small changes to turn inf and neginf values returned from norms to 10e8 and -10e8, respectively. This does not harm gradients themselves (if these for example were already inf/neginf b/c we clip gradients by multiplication and not overriding values).

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

…ring values or summing many large values results in +/- infinity. As we clip by multiplying with a clipping coefficient instead of overriding values inside of the gradients tensors this modification allows to clip very large gradients. Signed-off-by: Simon Zehnder <[email protected]>

Signed-off-by: Simon Zehnder <[email protected]>

sven1977 · 2024-05-02T09:34:45Z

rllib/utils/tests/test_torch_utils.py

@@ -94,6 +95,29 @@ def test_copy_torch_tensors(self):
                all(copied_tensor.detach().numpy() == tensor.detach().cpu().numpy())
            )

+    def test_large_gradients_clipping(self):


Nice! Thanks for creating this important test case!

sven1977 · 2024-05-02T09:59:35Z

rllib/utils/torch_utils.py

        if torch.logical_or(total_norm.isnan(), total_norm.isinf()):
            raise RuntimeError(
                f"The total norm of order {norm_type} for gradients from "
                "`parameters` is non-finite, so it cannot be clipped. "
            )
-        clip_coef = grad_clip / (total_norm + 1e-6)
+        clip_coef = grad_clip / torch.maximum(
+            torch.tensor(grad_clip), total_norm + 1e-6


Two questions:

Would this torch.tensor() pose a danger when on another device? GPU?

Can we add a comment here (or enhance the one below) explaining why we compute the coeff like this? What are the expected final values of the coeff (between 0.0 and 1.0)?

Good catch! Yes, let's put the tensor on the device we have extracted before to be save. I will add then a note why we want the coefficient not larger than 1.0.

sven1977

Looks great! Thanks for fixing this very important piece of code @simonsays1980 . Just 2 nits and questions before we can merge.

Signed-off-by: Simon Zehnder <[email protected]>

sven1977 · 2024-05-02T11:23:12Z

rllib/utils/torch_utils.py

-        clip_coef = grad_clip / (total_norm + 1e-6)
+        # We do want the coefficient to be in between 0.0 and 1.0, therefore
+        # if the global_norm is smaller than the clip value, we use the clip value
+        # as normalization constant.


Perfect, thanks for making this comment much more clear!

sven1977 · 2024-05-02T11:23:28Z

Looks great, let's merge once tests pass ...

simonsays1980 added rllib RLlib related issues rllib-newstack labels Apr 30, 2024

simonsays1980 self-assigned this Apr 30, 2024

simonsays1980 added 3 commits April 30, 2024 15:45

Added the same safeguards to the case 'grad_clip_by=norm'.

75cf458

Signed-off-by: Simon Zehnder <[email protected]>

Added API stability annotation.

71f8151

Signed-off-by: Simon Zehnder <[email protected]>

Merge branch 'master' into fix-overflow-in-pytorch-global-norm

5ff907c

Signed-off-by: Simon Zehnder <[email protected]>

sven1977 reviewed May 2, 2024

View reviewed changes

simonsays1980 marked this pull request as ready for review May 2, 2024 09:45

simonsays1980 requested review from avnishn, ArturNiederfahrenhorst, maxpumperla and kouroshHakha as code owners May 2, 2024 09:45

sven1977 reviewed May 2, 2024

View reviewed changes

sven1977 approved these changes May 2, 2024

View reviewed changes

sven1977 assigned sven1977 and unassigned simonsays1980 May 2, 2024

Put clip coefficient on device and added a note.

7edcfcc

Signed-off-by: Simon Zehnder <[email protected]>

sven1977 reviewed May 2, 2024

View reviewed changes

sven1977 merged commit 711f386 into ray-project:master May 2, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] - Fix numerical overflow in gradient clipping for (many) large gradients #45055

[RLlib] - Fix numerical overflow in gradient clipping for (many) large gradients #45055

simonsays1980 commented Apr 30, 2024 •

edited

Loading

sven1977 May 2, 2024

sven1977 May 2, 2024

simonsays1980 May 2, 2024

sven1977 left a comment

sven1977 May 2, 2024

sven1977 commented May 2, 2024

[RLlib] - Fix numerical overflow in gradient clipping for (many) large gradients #45055

[RLlib] - Fix numerical overflow in gradient clipping for (many) large gradients #45055

Conversation

simonsays1980 commented Apr 30, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

sven1977 May 2, 2024

Choose a reason for hiding this comment

sven1977 May 2, 2024

Choose a reason for hiding this comment

simonsays1980 May 2, 2024

Choose a reason for hiding this comment

sven1977 left a comment

Choose a reason for hiding this comment

sven1977 May 2, 2024

Choose a reason for hiding this comment

sven1977 commented May 2, 2024

simonsays1980 commented Apr 30, 2024 •

edited

Loading