-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] train_fashion_mnist_example
fails with 1 worker
#19506
Comments
The error information seems not complete, here is the whole info: |
train_fashion_mnist_example
fails with 1 worker
Hey @JanJF, thanks for letting us know about this issue! The reason for the behavior you're seeing is indeed that the Also just a small heads up we've rebranded RaySGD to Ray Train, so you may notice some changes in the docs/package structure! |
I can infer the reason that the process group is not initiated. But I'm confused where is the process group initiated when the num_works=2. |
This is handled within the Ray Train library code, in TorchBackend. This is done "implicitly", but if the end-user needs additional configuration they can pass in a |
Note that this is only started if |
Thanks! I tried several times and find out it will initial when running the |
Here is my code if it helps.
|
Hey @JanJF, I created a new issue for the follow up question. We can move the discussion there. |
Search before asking
Ray Component
Others
What happened + What you expected to happen
I'm new to ray and learning the example of raysgd here:
RaySGD-examples-train_fashion_mnist_example
But with the default parameter "num-workers=1", I got a error:
Traceback (most recent call last): File "...ray_example/raysgd_train_mnist.py", line 173, in <module> train_fashion_mnist(num_workers=args.num_workers, use_gpu=args.use_gpu) File "...ray_example/raysgd_train_mnist.py", line 127, in train_fashion_mnist result = trainer.run( File ".../site-packages/ray/util/sgd/v2/trainer.py", line 240, in run for intermediate_result in iterator: File ".../site-packages/ray/util/sgd/v2/trainer.py", line 567, in __next__ self._run_with_error_handling( File ".../site-packages/ray/util/sgd/v2/trainer.py", line 537, in _run_with_error_handling return func() File ".../site-packages/ray/util/sgd/v2/backends/backend.py", line 600, in finish_training results = self.get_with_failure_handling(futures) File ".../site-packages/ray/util/sgd/v2/backends/backend.py", line 619, in get_with_failure_handling success, failed_worker_indexes = check_for_failure(remote_values) File ".../site-packages/ray/util/sgd/v2/utils.py", line
But if set --num-workers:2, it can work well.
Can someone help me figure out how it works.
Versions / Dependencies
ray 1.7
Python 3.8
Win10 1909
Wsl 1
Reproduction script
Just the official example
Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: