[air] Add train/tune benchmark #26564

xwjiang2010 · 2022-07-14T15:15:43Z

Signed-off-by: Xiaowei Jiang [email protected]

Why are these changes needed?

Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials.
Some overhead is expected.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Xiaowei Jiang <[email protected]>

krfricke

Looks great! Have you been able to run it?

Can we also add a single node case? 1x1 with CPU=1 vs. 1x1 with CPU=8

release/release_tests.yaml

krfricke · 2022-07-14T16:01:49Z

release/air_tests/air_benchmarks/workloads/tune_benchmark.py

+        train_loop_config=CONFIG,
+        scaling_config={
+            "num_workers": 4,
+            "resources_per_worker": {"CPU": 2},


I think we'll have to use CPU=4 here to utilize all workers

Suggested change

"resources_per_worker": {"CPU": 2},

"resources_per_worker": {"CPU": 4},

wait, we are using m5.2xlarge (8CPU) instance and each trial is taking 1 machine. All together there are 8 machines.

Signed-off-by: Xiaowei Jiang <[email protected]>

…_benchmark

Signed-off-by: Xiaowei Jiang <[email protected]>

richardliaw · 2022-07-18T00:42:52Z

release/air_tests/air_benchmarks/workloads/benchmark_util.py

+def time_it(f):
+    @wraps(f)
+    def wrapper(*args, **kwargs):
+        start = time.monotonic()
+        f(*args, **kwargs)
+        time_taken = time.monotonic() - start
+        return time_taken
+
+    return wrapper


maybe we can just use the default python timeit

https://docs.python.org/3/library/timeit.html#python-interface

Signed-off-by: Richard Liaw <[email protected]>

richardliaw · 2022-07-18T05:47:01Z

python/ray/train/torch/train_loop_utils.py

+                # Some backwards compatibility
+                try:
+                    local_rank = session.get_local_rank()
+                except Exception:
+                    local_rank = train.local_rank()
+                # https://github.com/pytorch/pytorch/blob/35563f4fcd28e486cc5
+                # 8053acc15fe280123e7be/torch/distributed/launch.py#L72-L97
+                device_id = local_rank
+                logger.debug(f"setting device id {device_id} as local rank.")


@amogkam Unless I set this to local_rank, the benchmark script will error out.

If you print get_gpu_ids(), you get (2, 3) (len=2). However, you still need to set the device to local_rank or else you get a runtime error cuda invalid device ordinal.

I took a look at the blame which made this change here: 029517a

which is obviously different in logic but not quite sure what you were trying to do in the previous logic.

Yeah this is a bug…will be fixed by #26493

using local_rank directly won’t work for fractional or multiple gpus per worker

Signed-off-by: Richard Liaw <[email protected]>

richardliaw · 2022-07-18T20:47:36Z

Do not merge yet, this is based off of #26493

Signed-off-by: Richard Liaw <[email protected]>

Signed-off-by: Kai Fricke <[email protected]>

xwjiang2010 · 2022-07-19T13:36:51Z

Hey folks, I am back. Picking this up now. ETA: EOD today.

Signed-off-by: Kai Fricke <[email protected]>

krfricke · 2022-07-19T14:00:50Z

release/air_tests/air_benchmarks/workloads/tune_torch_benchmark.py

+            resources_per_worker={"CPU": 2},
+            trainer_resources={"CPU": 0},
+            use_gpu=use_gpu,
+            placement_strategy="STRICT_PACK",


cc @xwjiang2010 @amogkam @richardliaw @ericl STRICT_PACK makes a huge difference. Without it tune would consistently be ~60% worse than train (failing the threshold), most likely to workers being scheduled on different nodes. With it we are consistently within the allowed 20% overhead, even with different training job sizes (e.g. more epochs).

Should we raise this with core (improve "PACK" placement strategy)? In any case we should document this.

Actually it does not consistently pass, but it seems to be closer. It could be due to constant overhead with the relatively short training time. I'll increase once more.

With more epochs it seems to be closer, indicating a constant setup overhead: https://buildkite.com/ray-project/release-tests-pr/builds/10008

Isn't this an odd benchmark case? Most users would presumably be using multiple GPUs per worker (as big as possible), so the packing strategy is irrelevant.

Performance on these Data parallel systems require processes to own 1 gpu each, even with multiple gpus per node.

It's from the underlying allreduce algorithm

Wait this seems like a pretty serious bug? If STRICT_PACK is feasible then PACK should always equal STRICT_PACK.

Btw we recently changed the default to SPREAD cc @matthewdeng. As @ericl pointed about before, all of our benchmarks should be using default configs, no?

cc @xwjiang2010 @amogkam @richardliaw @ericl STRICT_PACK makes a huge difference. Without it tune would consistently be ~60% worse than train (failing the threshold), most likely to workers being scheduled on different nodes. With it we are consistently within the allowed 20% overhead, even with different training job sizes (e.g. more epochs).

Should we raise this with core (improve "PACK" placement strategy)? In any case we should document this.

Just to double check here, was this under the assumption that the default strategy was PACK and not SPREAD? (I am changing the default back to PACK and will re-run this test with default strategy)

krfricke · 2022-07-19T14:01:18Z

Latest run: https://buildkite.com/ray-project/release-tests-pr/builds/10002#018216bf-4b81-4483-af95-e1800e2a4397

Signed-off-by: Kai Fricke <[email protected]>

krfricke · 2022-07-19T14:30:11Z

Another run: https://buildkite.com/ray-project/release-tests-pr/builds/10008

Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials. Some overhead is expected. Signed-off-by: Xiaowei Jiang <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Co-authored-by: Jimmy Yao <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Signed-off-by: Rohan138 <[email protected]>

Making sure that tuning multiple trials in parallel is not significantly slower than training each individual trials. Some overhead is expected. Signed-off-by: Xiaowei Jiang <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Signed-off-by: Kai Fricke <[email protected]> Co-authored-by: Jimmy Yao <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Kai Fricke <[email protected]> Signed-off-by: Stefan van der Kleij <[email protected]>

JiahaoYao and others added 8 commits July 13, 2022 01:02

use local rank for cuda device

72108b7

adding the test

ee70c88

lint

e72378c

fix ci

6a8e934

revert the test

e4e967e

gpu test

4a8112e

not use private utils

e2911c7

[air benchmark] add tune benchmark

48a3ca4

Signed-off-by: Xiaowei Jiang <[email protected]>

krfricke reviewed Jul 14, 2022

View reviewed changes

release/release_tests.yaml Show resolved Hide resolved

krfricke reviewed Jul 14, 2022

View reviewed changes

add compute tmpl.

8e6dc82

Signed-off-by: Xiaowei Jiang <[email protected]>

xwjiang2010 force-pushed the tune_benchmark branch from 2aeafa4 to 8e6dc82 Compare July 14, 2022 21:21

xwjiang2010 added 2 commits July 14, 2022 14:28

Merge branch 'master' of https://github.com/ray-project/ray into tune…

73d6259

…_benchmark

lint

50397e6

Signed-off-by: Xiaowei Jiang <[email protected]>

richardliaw reviewed Jul 18, 2022

View reviewed changes

richardliaw added 2 commits July 17, 2022 17:59

Merge branch 'master' into tune_benchmark

da017b6

update

a89a577

Signed-off-by: Richard Liaw <[email protected]>

richardliaw self-assigned this Jul 18, 2022

richardliaw added 6 commits July 17, 2022 19:02

timed

fdd6835

Signed-off-by: Richard Liaw <[email protected]>

lint

b827e22

Signed-off-by: Richard Liaw <[email protected]>

finish-workspace

42cb011

update

d3572f2

Signed-off-by: Richard Liaw <[email protected]>

update

a980af5

Signed-off-by: Richard Liaw <[email protected]>

reduce-bounds

87dc7b8

Signed-off-by: Richard Liaw <[email protected]>

richardliaw reviewed Jul 18, 2022

View reviewed changes

richardliaw changed the title ~~[air benchmark] add tune benchmark~~ [air] train/tune benchmark + fix train device setting Jul 18, 2022

richardliaw and others added 4 commits July 17, 2022 23:55

fx

287e9d7

Signed-off-by: Richard Liaw <[email protected]>

adding the test and comment

09f2fa5

Merge remote-tracking branch 'upstream/master' into 0712_gpuid

0f9d437

fix error

de881c4

richardliaw added 5 commits July 18, 2022 14:28

update

ab129db

Signed-off-by: Richard Liaw <[email protected]>

Merge branch '0712_gpuid' into tune_benchmark

8d3b0c8

update

5d680e4

Signed-off-by: Richard Liaw <[email protected]>

Merge branch '0712_gpuid' into tune_benchmark

8caab34

Merge branch 'master' into tune_benchmark

3d5d3fa

richardliaw changed the title ~~[air] train/tune benchmark + fix train device setting~~ [air] train/tune benchmark Jul 19, 2022

Kai Fricke added 3 commits July 19, 2022 10:33

Adjust num nodes

e29546a

Signed-off-by: Kai Fricke <[email protected]>

scaling config update

c173e4d

Signed-off-by: Kai Fricke <[email protected]>

Fix working dir

07bad0e

Signed-off-by: Kai Fricke <[email protected]>

Kai Fricke added 2 commits July 19, 2022 14:49

Multiple runs

f72f947

Signed-off-by: Kai Fricke <[email protected]>

Multiple runs

ab00990

Signed-off-by: Kai Fricke <[email protected]>

krfricke reviewed Jul 19, 2022

View reviewed changes

increase epochs

72b9495

Signed-off-by: Kai Fricke <[email protected]>

krfricke merged commit 75027eb into ray-project:master Jul 19, 2022

matthewdeng mentioned this pull request Jul 20, 2022

[air] change default strategy to PACK #26757

Merged

8 tasks

xwjiang2010 changed the title ~~[air] train/tune benchmark~~ [air] Add train/tune benchmark Aug 19, 2022

xwjiang2010 deleted the tune_benchmark branch July 26, 2023 19:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[air] Add train/tune benchmark #26564

[air] Add train/tune benchmark #26564

xwjiang2010 commented Jul 14, 2022 •

edited

Loading

krfricke left a comment

krfricke Jul 14, 2022

xwjiang2010 Jul 14, 2022

richardliaw Jul 18, 2022

richardliaw Jul 18, 2022

amogkam Jul 18, 2022

richardliaw commented Jul 18, 2022

xwjiang2010 commented Jul 19, 2022

krfricke Jul 19, 2022

krfricke Jul 19, 2022

krfricke Jul 19, 2022

ericl Jul 19, 2022

richardliaw Jul 19, 2022

richardliaw Jul 19, 2022

amogkam Jul 19, 2022 •

edited

Loading

matthewdeng Jul 20, 2022

krfricke commented Jul 19, 2022

krfricke commented Jul 19, 2022

	"resources_per_worker": {"CPU": 2},
	"resources_per_worker": {"CPU": 4},

[air] Add train/tune benchmark #26564

[air] Add train/tune benchmark #26564

Conversation

xwjiang2010 commented Jul 14, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

krfricke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw commented Jul 18, 2022

xwjiang2010 commented Jul 19, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amogkam Jul 19, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke commented Jul 19, 2022

krfricke commented Jul 19, 2022

xwjiang2010 commented Jul 14, 2022 •

edited

Loading

amogkam Jul 19, 2022 •

edited

Loading