-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[air] Add train/tune benchmark #26564
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Have you been able to run it?
Can we also add a single node case? 1x1 with CPU=1 vs. 1x1 with CPU=8
train_loop_config=CONFIG, | ||
scaling_config={ | ||
"num_workers": 4, | ||
"resources_per_worker": {"CPU": 2}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll have to use CPU=4 here to utilize all workers
"resources_per_worker": {"CPU": 2}, | |
"resources_per_worker": {"CPU": 4}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, we are using m5.2xlarge (8CPU) instance and each trial is taking 1 machine. All together there are 8 machines.
Signed-off-by: Xiaowei Jiang <[email protected]>
2aeafa4
to
8e6dc82
Compare
Signed-off-by: Xiaowei Jiang <[email protected]>
def time_it(f): | ||
@wraps(f) | ||
def wrapper(*args, **kwargs): | ||
start = time.monotonic() | ||
f(*args, **kwargs) | ||
time_taken = time.monotonic() - start | ||
return time_taken | ||
|
||
return wrapper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can just use the default python timeit
https://docs.python.org/3/library/timeit.html#python-interface
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
# Some backwards compatibility | ||
try: | ||
local_rank = session.get_local_rank() | ||
except Exception: | ||
local_rank = train.local_rank() | ||
# https://github.com/pytorch/pytorch/blob/35563f4fcd28e486cc5 | ||
# 8053acc15fe280123e7be/torch/distributed/launch.py#L72-L97 | ||
device_id = local_rank | ||
logger.debug(f"setting device id {device_id} as local rank.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amogkam Unless I set this to local_rank, the benchmark script will error out.
If you print get_gpu_ids()
, you get (2, 3) (len=2). However, you still need to set the device to local_rank or else you get a runtime error cuda invalid device ordinal.
I took a look at the blame which made this change here: 029517a
which is obviously different in logic but not quite sure what you were trying to do in the previous logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is a bug…will be fixed by #26493
using local_rank directly won’t work for fractional or multiple gpus per worker
Signed-off-by: Richard Liaw <[email protected]>
Do not merge yet, this is based off of #26493 |
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
Hey folks, I am back. Picking this up now. ETA: EOD today. |
Signed-off-by: Kai Fricke <[email protected]>
Signed-off-by: Kai Fricke <[email protected]>
resources_per_worker={"CPU": 2}, | ||
trainer_resources={"CPU": 0}, | ||
use_gpu=use_gpu, | ||
placement_strategy="STRICT_PACK", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @xwjiang2010 @amogkam @richardliaw @ericl STRICT_PACK makes a huge difference. Without it tune would consistently be ~60% worse than train (failing the threshold), most likely to workers being scheduled on different nodes. With it we are consistently within the allowed 20% overhead, even with different training job sizes (e.g. more epochs).
Should we raise this with core (improve "PACK" placement strategy)? In any case we should document this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it does not consistently pass, but it seems to be closer. It could be due to constant overhead with the relatively short training time. I'll increase once more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With more epochs it seems to be closer, indicating a constant setup overhead: https://buildkite.com/ray-project/release-tests-pr/builds/10008
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this an odd benchmark case? Most users would presumably be using multiple GPUs per worker (as big as possible), so the packing strategy is irrelevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance on these Data parallel systems require processes to own 1 gpu each, even with multiple gpus per node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from the underlying allreduce algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait this seems like a pretty serious bug? If STRICT_PACK is feasible then PACK should always equal STRICT_PACK.
Btw we recently changed the default to SPREAD cc @matthewdeng. As @ericl pointed about before, all of our benchmarks should be using default configs, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @xwjiang2010 @amogkam @richardliaw @ericl STRICT_PACK makes a huge difference. Without it tune would consistently be ~60% worse than train (failing the threshold), most likely to workers being scheduled on different nodes. With it we are consistently within the allowed 20% overhead, even with different training job sizes (e.g. more epochs).
Should we raise this with core (improve "PACK" placement strategy)? In any case we should document this.
Just to double check here, was this under the assumption that the default strategy was PACK and not SPREAD? (I am changing the default back to PACK and will re-run this test with default strategy)
Signed-off-by: Kai Fricke <[email protected]>
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials. Some overhead is expected. Signed-off-by: Xiaowei Jiang <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Co-authored-by: Jimmy Yao <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Signed-off-by: Rohan138 <[email protected]>
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials. Some overhead is expected. Signed-off-by: Xiaowei Jiang <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Co-authored-by: Jimmy Yao <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>
Signed-off-by: Xiaowei Jiang [email protected]
Why are these changes needed?
Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.