Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[air] Add train/tune benchmark #26564

Merged
merged 39 commits into from
Jul 19, 2022
Merged

Conversation

xwjiang2010
Copy link
Contributor

@xwjiang2010 xwjiang2010 commented Jul 14, 2022

Signed-off-by: Xiaowei Jiang [email protected]

Why are these changes needed?

Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Have you been able to run it?

Can we also add a single node case? 1x1 with CPU=1 vs. 1x1 with CPU=8

release/release_tests.yaml Show resolved Hide resolved
train_loop_config=CONFIG,
scaling_config={
"num_workers": 4,
"resources_per_worker": {"CPU": 2},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll have to use CPU=4 here to utilize all workers

Suggested change
"resources_per_worker": {"CPU": 2},
"resources_per_worker": {"CPU": 4},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, we are using m5.2xlarge (8CPU) instance and each trial is taking 1 machine. All together there are 8 machines.

Signed-off-by: Xiaowei Jiang <[email protected]>
Comment on lines 122 to 130
def time_it(f):
@wraps(f)
def wrapper(*args, **kwargs):
start = time.monotonic()
f(*args, **kwargs)
time_taken = time.monotonic() - start
return time_taken

return wrapper
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can just use the default python timeit

https://docs.python.org/3/library/timeit.html#python-interface

@richardliaw richardliaw self-assigned this Jul 18, 2022
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Comment on lines 470 to 478
# Some backwards compatibility
try:
local_rank = session.get_local_rank()
except Exception:
local_rank = train.local_rank()
# https://github.com/pytorch/pytorch/blob/35563f4fcd28e486cc5
# 8053acc15fe280123e7be/torch/distributed/launch.py#L72-L97
device_id = local_rank
logger.debug(f"setting device id {device_id} as local rank.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amogkam Unless I set this to local_rank, the benchmark script will error out.

If you print get_gpu_ids(), you get (2, 3) (len=2). However, you still need to set the device to local_rank or else you get a runtime error cuda invalid device ordinal.

I took a look at the blame which made this change here: 029517a

which is obviously different in logic but not quite sure what you were trying to do in the previous logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is a bug…will be fixed by #26493

using local_rank directly won’t work for fractional or multiple gpus per worker

@richardliaw richardliaw changed the title [air benchmark] add tune benchmark [air] train/tune benchmark + fix train device setting Jul 18, 2022
@richardliaw
Copy link
Contributor

Do not merge yet, this is based off of #26493

@richardliaw richardliaw changed the title [air] train/tune benchmark + fix train device setting [air] train/tune benchmark Jul 19, 2022
Kai Fricke added 3 commits July 19, 2022 10:33
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
@xwjiang2010
Copy link
Contributor Author

Hey folks, I am back. Picking this up now. ETA: EOD today.

Kai Fricke added 2 commits July 19, 2022 14:49
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
resources_per_worker={"CPU": 2},
trainer_resources={"CPU": 0},
use_gpu=use_gpu,
placement_strategy="STRICT_PACK",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @xwjiang2010 @amogkam @richardliaw @ericl STRICT_PACK makes a huge difference. Without it tune would consistently be ~60% worse than train (failing the threshold), most likely to workers being scheduled on different nodes. With it we are consistently within the allowed 20% overhead, even with different training job sizes (e.g. more epochs).

Should we raise this with core (improve "PACK" placement strategy)? In any case we should document this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it does not consistently pass, but it seems to be closer. It could be due to constant overhead with the relatively short training time. I'll increase once more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With more epochs it seems to be closer, indicating a constant setup overhead: https://buildkite.com/ray-project/release-tests-pr/builds/10008

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this an odd benchmark case? Most users would presumably be using multiple GPUs per worker (as big as possible), so the packing strategy is irrelevant.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance on these Data parallel systems require processes to own 1 gpu each, even with multiple gpus per node.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's from the underlying allreduce algorithm

Copy link
Contributor

@amogkam amogkam Jul 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait this seems like a pretty serious bug? If STRICT_PACK is feasible then PACK should always equal STRICT_PACK.

Btw we recently changed the default to SPREAD cc @matthewdeng. As @ericl pointed about before, all of our benchmarks should be using default configs, no?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @xwjiang2010 @amogkam @richardliaw @ericl STRICT_PACK makes a huge difference. Without it tune would consistently be ~60% worse than train (failing the threshold), most likely to workers being scheduled on different nodes. With it we are consistently within the allowed 20% overhead, even with different training job sizes (e.g. more epochs).

Should we raise this with core (improve "PACK" placement strategy)? In any case we should document this.

Just to double check here, was this under the assumption that the default strategy was PACK and not SPREAD? (I am changing the default back to PACK and will re-run this test with default strategy)

@krfricke
Copy link
Contributor

Signed-off-by: Kai Fricke <[email protected]>
@krfricke
Copy link
Contributor

@krfricke krfricke merged commit 75027eb into ray-project:master Jul 19, 2022
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.

Signed-off-by: Xiaowei Jiang <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>

Co-authored-by: Jimmy Yao <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
Signed-off-by: Rohan138 <[email protected]>
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.

Signed-off-by: Xiaowei Jiang <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>

Co-authored-by: Jimmy Yao <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: Kai Fricke <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
@xwjiang2010 xwjiang2010 changed the title [air] train/tune benchmark [air] Add train/tune benchmark Aug 19, 2022
@xwjiang2010 xwjiang2010 deleted the tune_benchmark branch July 26, 2023 19:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants