[AIR] Experiment restore stress tests #33706

justinvyu · 2023-03-25T01:50:34Z

Why are these changes needed?

This test is meant to be an integration stress test for Train/Tune experiment restoration.

Test setup

For Tuner.restore:
- 8 trials, with a max of 2 running concurrently (--> 4 rounds of trials)
- Each iteration takes 0.5 seconds
- Each trial runs for 8 iterations --> 4 seconds
- Each round of 2 trials should take 4 seconds
- Without any interrupts/restoration:
  - Minimum runtime: 4 rounds * 4 seconds / round = 16 seconds
- The test will stop the script with a SIGINT at a random time between
  4-8 iterations each restore.
For Trainer.restore:
- 1 trial with 4 workers
- Each iteration takes 0.5 seconds
- Runs for 32 iterations --> Minimum runtime = 16 seconds
- The test will stop the script with a SIGINT at a random time between
  4-8 iterations after each restore.

Test Passing Requirements

Req 1: Reasonable runtime
- The experiment should finish within 1.5 * 16 = 24 seconds.
- 1.5x is the passing threshold.
Req 2: Training progress persisted
- The experiment should progress monotonically.
  (The training iteration shouldn't go backward at any point)
- Trials shouldn't start from scratch.
Req 3: Searcher state saved/restored correctly
Req 4: Callback state saved/restored correctly

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Justin Yu <[email protected]>

…experiment_restore_tests

Signed-off-by: Justin Yu <[email protected]>

…experiment_restore_tests

Signed-off-by: Justin Yu <[email protected]>

…experiment_restore_tests Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]>

…experiment_restore_tests

Signed-off-by: Justin Yu <[email protected]>

…experiment_restore_tests

Signed-off-by: Justin Yu <[email protected]>

gjoliver

thanks man!

gjoliver · 2023-04-11T03:59:10Z

python/ray/air/tests/_test_experiment_restore_run.py

+        os.environ.get("RUN_STARTED_MARKER", "/tmp/does-not-exist")
+    )
+    if training_started_marker.exists():
+        # Multiple workers may be trying to delete the same marker


instead of try ... except, can you just missing_ok=True?

I used that originally but seems like missing_ok was introduced in py38.

can you please except FileNotFoundError instead then?

python/ray/air/tests/_test_experiment_restore_run.py

python/ray/air/tests/test_experiment_restore.py

gjoliver · 2023-04-11T04:06:35Z

python/ray/air/tests/test_experiment_restore.py

+        run_started_marker.write_text("", encoding="utf-8")
+
+        run = subprocess.Popen(
+            [sys.executable, script_path], env=env  # , stderr=subprocess.PIPE


actually why do you want to go the subprocess route?
why not just write those train() and tune() functions as test code here, and simply call the functions?

I think this is because we are simulating script interruption by user input here, which is closer to the user behavior if run in a subprocess

I see. 👌

krfricke

Generally looks good to me!

krfricke · 2023-04-11T14:49:15Z

python/ray/air/tests/test_experiment_restore.py

+        run_started_marker.write_text("", encoding="utf-8")
+
+        run = subprocess.Popen(
+            [sys.executable, script_path], env=env  # , stderr=subprocess.PIPE


I think this is because we are simulating script interruption by user input here, which is closer to the user behavior if run in a subprocess

python/ray/train/base_trainer.py

Signed-off-by: Justin Yu <[email protected]>

…experiment_restore_tests

gjoliver

cool, cool.
a couple of nits left. thanks again.

gjoliver · 2023-04-12T01:50:51Z

python/ray/air/tests/_test_experiment_restore_run.py

+        os.environ.get("RUN_STARTED_MARKER", "/tmp/does-not-exist")
+    )
+    if training_started_marker.exists():
+        # Multiple workers may be trying to delete the same marker


can you please except FileNotFoundError instead then?

gjoliver · 2023-04-12T01:51:35Z

python/ray/air/tests/test_experiment_restore.py

+        run_started_marker.write_text("", encoding="utf-8")
+
+        run = subprocess.Popen(
+            [sys.executable, script_path], env=env  # , stderr=subprocess.PIPE


I see. 👌

gjoliver · 2023-04-12T01:54:35Z

python/ray/air/tests/test_experiment_restore.py

+    # Pass criteria
+    no_interrupts_runtime = 16.0
+    passing_factor = 1.5
+    passing_runtime = no_interrupts_runtime * passing_factor


I am a little worried about this hardcoded runtime, since tests can act quite differently on CI machines.
I hope this won't be flaky.

@gjoliver I see, yeah I was a bit worried about this too. What about making the threshold much more lenient? Like 2x.

The total_runtime calculation is only adding up actual training time. On every run, it's calculated as the time between training started and when the run gets killed by interrupt. So, it's independent of extra time CI machines might take to initialize Trainable, and handle restoration, etc.

Signed-off-by: Justin Yu <[email protected]>

…experiment_restore_tests

krfricke

Nice! @gjoliver can you merge once you're happy?

Signed-off-by: Justin Yu <[email protected]>

Signed-off-by: Justin Yu <[email protected]> Signed-off-by: elliottower <[email protected]>

Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Jack He <[email protected]>

justinvyu added 12 commits March 23, 2023 23:12

fix typo

a6799c0

Signed-off-by: Justin Yu <[email protected]>

Add initial test

2e7927b

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

3c6c7a1

…experiment_restore_tests

draft 2

962612c

Signed-off-by: Justin Yu <[email protected]>

working version of tuner restore stress test

db9af9c

Signed-off-by: Justin Yu <[email protected]>

add case for trainer

64a915a

Signed-off-by: Justin Yu <[email protected]>

Fix lint

057639f

Signed-off-by: Justin Yu <[email protected]>

minor fixes (wrap in main method)

000edba

Signed-off-by: Justin Yu <[email protected]>

move to air

34653f0

Signed-off-by: Justin Yu <[email protected]>

add to build

f9ad0a6

Signed-off-by: Justin Yu <[email protected]>

change some configs

3fd9aff

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

83b6025

…experiment_restore_tests

justinvyu assigned matthewdeng Mar 25, 2023

justinvyu marked this pull request as draft March 25, 2023 01:50

justinvyu added 9 commits March 24, 2023 22:50

[no_early_kickoff] merge

bfa3050

Signed-off-by: Justin Yu <[email protected]>

add helper file to the test srcs

e149031

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

6cf3f8b

…experiment_restore_tests Signed-off-by: Justin Yu <[email protected]>

Fix test for trainer (don't serialize datasets)

309a5df

Signed-off-by: Justin Yu <[email protected]>

Improve some docstrings

0c4dd6e

Signed-off-by: Justin Yu <[email protected]>

add csv datasource

132e6e4

Signed-off-by: Justin Yu <[email protected]>

fix total runtime calculation to account for early end

cc38a5d

Signed-off-by: Justin Yu <[email protected]>

switch to using storage_path

83a174e

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

6ec2764

…experiment_restore_tests

justinvyu requested review from matthewdeng, gjoliver and krfricke April 10, 2023 18:48

justinvyu marked this pull request as ready for review April 10, 2023 18:48

justinvyu added 3 commits April 10, 2023 15:39

fix for py37

1edfb52

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

7b01dcd

…experiment_restore_tests

Fix lint

4aea844

Signed-off-by: Justin Yu <[email protected]>

gjoliver reviewed Apr 11, 2023

View reviewed changes

krfricke reviewed Apr 11, 2023

View reviewed changes

justinvyu added 6 commits April 11, 2023 10:07

Address some style comments

238be73

Signed-off-by: Justin Yu <[email protected]>

Some cleanup

fae1aff

Signed-off-by: Justin Yu <[email protected]>

Fix test

66d53cc

Signed-off-by: Justin Yu <[email protected]>

Add some clarifying docstring

a398eea

Signed-off-by: Justin Yu <[email protected]>

more clarifications

f7eda3e

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

b15a618

…experiment_restore_tests

justinvyu requested review from krfricke and gjoliver April 11, 2023 21:44

gjoliver approved these changes Apr 12, 2023

View reviewed changes

justinvyu added 2 commits April 12, 2023 17:47

address comments

f236ef9

Signed-off-by: Justin Yu <[email protected]>

Merge branch 'master' of https://github.com/ray-project/ray into air/…

9de3ed9

…experiment_restore_tests

krfricke approved these changes Apr 13, 2023

View reviewed changes

gjoliver merged commit 17eb052 into ray-project:master Apr 13, 2023

vitsai pushed a commit to vitsai/ray that referenced this pull request Apr 17, 2023

[AIR] Experiment restore stress tests (ray-project#33706)

84ac023

Signed-off-by: Justin Yu <[email protected]>

elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023

[AIR] Experiment restore stress tests (ray-project#33706)

32b4b92

Signed-off-by: Justin Yu <[email protected]> Signed-off-by: elliottower <[email protected]>

ProjectsByJackHe pushed a commit to ProjectsByJackHe/ray that referenced this pull request May 4, 2023

[AIR] Experiment restore stress tests (ray-project#33706)

a6d4922

Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Jack He <[email protected]>

justinvyu deleted the air/experiment_restore_tests branch August 9, 2023 01:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIR] Experiment restore stress tests #33706

[AIR] Experiment restore stress tests #33706

justinvyu commented Mar 25, 2023 •

edited

Loading

gjoliver left a comment

gjoliver Apr 11, 2023

justinvyu Apr 11, 2023

gjoliver Apr 12, 2023

gjoliver Apr 11, 2023

krfricke Apr 11, 2023

gjoliver Apr 12, 2023

krfricke left a comment

krfricke Apr 11, 2023

gjoliver left a comment

gjoliver Apr 12, 2023

gjoliver Apr 12, 2023

gjoliver Apr 12, 2023

justinvyu Apr 12, 2023

krfricke left a comment

[AIR] Experiment restore stress tests #33706

[AIR] Experiment restore stress tests #33706

Conversation

justinvyu commented Mar 25, 2023 • edited Loading

Why are these changes needed?

Test setup

Test Passing Requirements

Related issue number

Checks

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gjoliver left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

justinvyu commented Mar 25, 2023 •

edited

Loading