[train] Fix regression where large Trainer attributes get serialized along with actor class #43234

justinvyu · 2024-02-16T19:56:17Z

Why are these changes needed?

#43146 re-introduced a problem that was originally fixed by #28826.

The problem:

When there are large objects captured in the scope of a Trainer's arguments, the Trainer instance on the driver needs to be sure not to package itself (self) when registering the actor class that will be run remotely.
Otherwise, self gets packaged along with the actor class during serialization, and so do these large objects.
Because Ray Train relies on creating a Ray Tune Trainable under the hood to schedule the remote Trainer instance (which serves as the dist. training coordinator), and because Ray Tune uses the Ray GCS redis key value store to save a shared copy of the serialized actor class, a huge actor class causes a gRPC error such as the one below:

E               ray.tune.error.TuneError: The Trainable/training function is too large for grpc resource limit. Check that its definition is not implicitly capturing a large array or other object in scope. Tip: use tune.with_parameters() to put large objects in the Ray object store. 
E               Original exception: Traceback (most recent call last):
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/experiment/experiment.py", line 149, in __init__
E                   self._run_identifier = Experiment.register_if_needed(run)
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/experiment/experiment.py", line 350, in register_if_needed
E                   register_trainable(name, run_object)
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/registry.py", line 112, in register_trainable
E                   _global_registry.register(TRAINABLE_CLASS, name, trainable)
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/registry.py", line 239, in register
E                   self.flush_values()
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/tune/registry.py", line 277, in flush_values
E                   _internal_kv_put(
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/_private/client_mode_hook.py", line 103, in wrapper
E                   return func(*args, **kwargs)
E                 File "/Users/justin/Developer/justinvyu-dev/python/ray/experimental/internal_kv.py", line 94, in _internal_kv_put
E                   return global_gcs_client.internal_kv_put(key, value, overwrite, namespace) == 0
E                 File "python/ray/_raylet.pyx", line 2614, in ray._raylet._auto_reconnect.wrapper
E                 File "python/ray/_raylet.pyx", line 2591, in ray._raylet._auto_reconnect.wrapper
E                 File "python/ray/_raylet.pyx", line 2728, in ray._raylet.GcsClient.internal_kv_put
E                 File "python/ray/_raylet.pyx", line 579, in ray._raylet.check_status
E               ray.exceptions.RpcError: Sent message larger than max (800008883 vs. 536870912)

The fix is to remove the reference to self in the train coordinator function trainable. This PR also restructures the code a bit to make it harder to make such a mistake in the future.

Future considerations

This redis key value store logic in Tune is probably not necessary and can be re-worked to remove this failure case.

Related issue number

Closes #43191
Closes #43192
Closes #43193
Closes #43194
Closes #43195

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

woshiyyya

LGTM! One question: What's the change that caused the regression?

woshiyyya · 2024-02-16T21:08:32Z

python/ray/train/base_trainer.py

-
-            # Evaluate datasets if they are wrapped in a factory.
-            trainer.datasets = {
-                k: d() if callable(d) else d for k, d in self.datasets.items()


Is this the self reference that caused serialization of Huge BaseTrainer object?

Yep that's it!

Signed-off-by: Justin Yu <[email protected]>

…trainer_trainable_size_regression

Signed-off-by: Justin Yu <[email protected]>

justinvyu · 2024-02-16T22:12:23Z

Running release test to verify fix: https://buildkite.com/ray-project/release/builds/8691

matthewdeng

via Studios 2016 on GIPHY

justinvyu added 3 commits February 16, 2024 11:28

FIX: don't pickle self along with train trainable

3361d4e

Signed-off-by: Justin Yu <[email protected]>

add test

ab4e828

Signed-off-by: Justin Yu <[email protected]>

fix lint

fb1a3c3

Signed-off-by: Justin Yu <[email protected]>

justinvyu requested review from matthewdeng, scottjlee and woshiyyya February 16, 2024 19:56

justinvyu assigned matthewdeng and woshiyyya Feb 16, 2024

woshiyyya reviewed Feb 16, 2024

View reviewed changes

justinvyu added 4 commits February 16, 2024 13:20

fix annotation lint error

fd75ea3

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into fix_…

e6f8f5e

…trainer_trainable_size_regression

fix lint

d73e35e

Signed-off-by: Justin Yu <[email protected]>

make sure the partial doesn't mess up the __name__

4cb6d36

Signed-off-by: Justin Yu <[email protected]>

matthewdeng approved these changes Feb 16, 2024

View reviewed changes

justinvyu merged commit e1f39ea into ray-project:master Feb 16, 2024
9 checks passed

justinvyu deleted the fix_trainer_trainable_size_regression branch February 16, 2024 23:40

ruisearch42 mentioned this pull request Jun 17, 2024

[Core] ray.exceptions.RpcError: Sent message larger than max #45884

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train] Fix regression where large Trainer attributes get serialized along with actor class #43234

[train] Fix regression where large Trainer attributes get serialized along with actor class #43234

justinvyu commented Feb 16, 2024 •

edited

Loading

woshiyyya left a comment •

edited

Loading

woshiyyya Feb 16, 2024

justinvyu Feb 16, 2024

justinvyu commented Feb 16, 2024

matthewdeng left a comment

[train] Fix regression where large Trainer attributes get serialized along with actor class #43234

[train] Fix regression where large Trainer attributes get serialized along with actor class #43234

Conversation

justinvyu commented Feb 16, 2024 • edited Loading

Why are these changes needed?

Future considerations

Related issue number

Checks

woshiyyya left a comment • edited Loading

Choose a reason for hiding this comment

woshiyyya Feb 16, 2024

Choose a reason for hiding this comment

justinvyu Feb 16, 2024

Choose a reason for hiding this comment

justinvyu commented Feb 16, 2024

matthewdeng left a comment

Choose a reason for hiding this comment

justinvyu commented Feb 16, 2024 •

edited

Loading

woshiyyya left a comment •

edited

Loading