Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated network retry delay strategy to scale #3306

Merged
merged 1 commit into from
Oct 17, 2020

Conversation

aakarshg
Copy link
Contributor

This allows for network retries, to scale well with the
number of machines, and still retains the existing functionality
for cases with smaller num_machines ( 500 )

Fixes #3301

@ghost
Copy link

ghost commented Aug 14, 2020

CLA assistant check
All CLA requirements met.

@aakarshg aakarshg force-pushed the scale_network_retry_delay branch 3 times, most recently from 77eb466 to e5dbedd Compare August 14, 2020 21:40
const int connect_fail_retry_cnt = 20;
const int connect_fail_retries_factor_machine = 25;
const int connect_fail_retries_scale_factor = static_cast<int>(num_machines_ / connect_fail_retries_factor_machine);
const int connect_fail_retry_cnt = std::max(20, connect_fail_retries_scale_factor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is 20 when num_machines_ = 500, Is this too small?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's 20 for num_machines_ less than 500, correct. It's too small if we're spinning say > 1000 num_machines_, in which case time to spin them up and get them running itself takes a while ( especially when we launch them as containers say in kubernetes cluster )

@aakarshg
Copy link
Contributor Author

@StrikerRUS and @guolinke can you PTAL at the PR again when you get a chance? Thanks :)

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a cpp reviewer, but the concept of this PR looks OK to me!
Just wonder, why do we need two different values 20 and 25?

@guolinke
Copy link
Collaborator

@StrikerRUS maybe @aakarshg can have better naming for these variates.
@aakarshg to me, the 25 is strange too. The number of nodes in distributed learning usually is the power of 2.

@aakarshg
Copy link
Contributor Author

@StrikerRUS I put in 2 different values 20 and 25, for following reasons:

  1. 20 is the current max number of retries, and i wanted to maintain the same upto the scale of 500 nodes so that it'll still fail early at small scale.
  2. I'm dividing number of machines by 25, so that upto 500 nodes ... (500/25) the maximum value of (num_machines/25, 20 ) will be 25 only.

But if we are more than 525 machines then the number of retries will be 21 and then scale up accordingly as we increase num_machines.

Hope that explains things.

About the number of nodes being usually power of 2, that's not true tbh. There are cases ( can't quite go into detail ) where num_machines is more tightly died to number of files needed to train data on. I was looking at around 900 or so machines :)

@guolinke
Copy link
Collaborator

@aakarshg
LightGBM uses many collective communication algorithms, which work better with power of 2 machines.

@StrikerRUS
Copy link
Collaborator

StrikerRUS commented Aug 21, 2020

@aakarshg

I'm dividing number of machines by 25

Why can't number of machines be divided by 20? I'm asking because I suppose that 1 magic number in code is better that 2 ones.

@aakarshg
Copy link
Contributor Author

@aakarshg

I'm dividing number of machines by 25

Why can't number of machines be divided by 20? I'm asking because I suppose that 1 magic number in code is better that 2 ones.

I can do that, but then the current behavior will only upto 400 machines, as 400/20 will be the current number of 20 retries. If that's okay then I'll update the PR to just divide by 20 and make code more accessible

@StrikerRUS
Copy link
Collaborator

I find both 400 and 500 machines in cluster very big number, IMHO. So I don't see any difference between 400 and 500. Is 500 something special threshold?

@aakarshg
Copy link
Contributor Author

I find both 400 and 500 machines in cluster very big number, IMHO. So I don't see any difference between 400 and 500. Is 500 something special threshold?

nothing really, its more of an opinion.

@aakarshg

I'm dividing number of machines by 25

Why can't number of machines be divided by 20? I'm asking because I suppose that 1 magic number in code is better that 2 ones.

agreed, updated the PR. PTAL again thanks :)

Comment on lines 190 to 191
const int connect_fail_retries_scale_factor = static_cast<int>(num_machines_ / 20);
const int connect_fail_retry_cnt = std::max(20, connect_fail_retries_scale_factor);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aakarshg Thank you for addressing comments! I'm not sure but maybe it will be better to store 20 in a constant? If we will need to update this value, that constant will allow us to do it in only one place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping @aakarshg

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gently ping @aakarshg

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that'll be absolutely okay, i'll update the pr :) and apologies for late replies.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

This allows for network retries, to scale well with the
number of machines, and still retains the existing functionality
for cases with smaller num_machines ( 500 )

Fixes microsoft#3301
@StrikerRUS StrikerRUS merged commit c0c65f7 into microsoft:master Oct 17, 2020
@aakarshg aakarshg deleted the scale_network_retry_delay branch October 19, 2020 13:46
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make MPI ring connection retry count configurable
4 participants