Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix port conflict in parallel dygraph mode #28497

Merged
merged 2 commits into from
Nov 16, 2020

Conversation

danleifeng
Copy link
Contributor

PR types

Bug fixes

PR changes

Others

Describe

when running dygraph parallel code in local machine, random failure will occur. It may be caused by the port conflict when gloo and nccl initialized at the same time.
rank0 log (send id, connect failed):
image
rank3 log (receive id, bind failed):
image

In this PR, we change the order of the gloo and nccl init like below:
HttpServer start -> NCCL init -> GLOO init

HttpServer is used for gloo initial, which will use rank0 port. After changing order, port conflict will be resolved.

@paddle-bot-old
Copy link

paddle-bot-old bot commented Nov 9, 2020

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@@ -140,21 +140,6 @@ def _check_var_exists(var_name):
http_server.start()
wait_server_ready([ParallelEnv().trainer_endpoints[0]])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line can be moved to line #169 (just after the nccl initialization) to decrease the wait time.

@danleifeng danleifeng closed this Nov 12, 2020
@danleifeng danleifeng reopened this Nov 12, 2020
@PaddlePaddle PaddlePaddle locked and limited conversation to collaborators Nov 12, 2020
@PaddlePaddle PaddlePaddle unlocked this conversation Nov 12, 2020
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@danleifeng danleifeng merged commit a24d186 into PaddlePaddle:develop Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants