-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIR] Experiment restore stress tests #33706
[AIR] Experiment restore stress tests #33706
Conversation
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…experiment_restore_tests
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…experiment_restore_tests
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…experiment_restore_tests Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…experiment_restore_tests
Signed-off-by: Justin Yu <[email protected]>
…experiment_restore_tests
Signed-off-by: Justin Yu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks man!
os.environ.get("RUN_STARTED_MARKER", "/tmp/does-not-exist") | ||
) | ||
if training_started_marker.exists(): | ||
# Multiple workers may be trying to delete the same marker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of try ... except, can you just missing_ok=True?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used that originally but seems like missing_ok
was introduced in py38.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please except FileNotFoundError instead then?
run_started_marker.write_text("", encoding="utf-8") | ||
|
||
run = subprocess.Popen( | ||
[sys.executable, script_path], env=env # , stderr=subprocess.PIPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually why do you want to go the subprocess route?
why not just write those train() and tune() functions as test code here, and simply call the functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because we are simulating script interruption by user input here, which is closer to the user behavior if run in a subprocess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good to me!
run_started_marker.write_text("", encoding="utf-8") | ||
|
||
run = subprocess.Popen( | ||
[sys.executable, script_path], env=env # , stderr=subprocess.PIPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is because we are simulating script interruption by user input here, which is closer to the user behavior if run in a subprocess
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]>
…experiment_restore_tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, cool.
a couple of nits left. thanks again.
os.environ.get("RUN_STARTED_MARKER", "/tmp/does-not-exist") | ||
) | ||
if training_started_marker.exists(): | ||
# Multiple workers may be trying to delete the same marker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please except FileNotFoundError instead then?
run_started_marker.write_text("", encoding="utf-8") | ||
|
||
run = subprocess.Popen( | ||
[sys.executable, script_path], env=env # , stderr=subprocess.PIPE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. 👌
# Pass criteria | ||
no_interrupts_runtime = 16.0 | ||
passing_factor = 1.5 | ||
passing_runtime = no_interrupts_runtime * passing_factor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little worried about this hardcoded runtime, since tests can act quite differently on CI machines.
I hope this won't be flaky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gjoliver I see, yeah I was a bit worried about this too. What about making the threshold much more lenient? Like 2x.
The total_runtime
calculation is only adding up actual training time. On every run, it's calculated as the time between training started and when the run gets killed by interrupt. So, it's independent of extra time CI machines might take to initialize Trainable, and handle restoration, etc.
Signed-off-by: Justin Yu <[email protected]>
…experiment_restore_tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! @gjoliver can you merge once you're happy?
Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Justin Yu <[email protected]> Signed-off-by: elliottower <[email protected]>
Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Jack He <[email protected]>
Why are these changes needed?
This test is meant to be an integration stress test for Train/Tune experiment restoration.
Test setup
For Tuner.restore:
4-8 iterations each restore.
For Trainer.restore:
4-8 iterations after each restore.
Test Passing Requirements
(The training iteration shouldn't go backward at any point)
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.