-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: catch SystemExit and error out [DET-2956] #1116
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the hardest way to do this change, because we are trying to mark every single function that involves user code.
In reality, we are writing a library and we should never be calling sys.exit() except around entrypoint when we are e.g. checking command line arguments.
So maybe we can just change all of the non-entrypoint sys.exit()
's that we have laying around (it's just a few) with AssertionErrors or something, and just wrap the following 3 functions calls with checks for sys.exit():
build_and_run_training_pipeline
indetermined/exec/harness.py
main()
indetermined/exec/worker_process.py
(you'll have to move the argv check out of the main() function)local_experiment
indetermined_cli/experiment.py
I think that would be guaranteed to catch all of the sys.exit
calls that a user could possibly make, and we only have to have the check in 3 different places.
9484250
to
71c1601
Compare
@shiyuann you have not responded to my high-level points from the previous review |
@rb-determined-ai Sorry, I thought I had already responded. There are more than just three places. We also need to catch |
71c1601
to
7025570
Compare
@rb-determined-ai Never mind. I can put it on the harness execution layer. |
e862662
to
de7197c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, this is way cleaner.
Do you know, will we hit issues with our own sys.exit()
in det._native._init_cluster_mode()
?
There are also additional sys.exit()
calls to remove in:
- _estimator_trial.py
- gpu.py
- ipc.py
de7197c
to
614a09a
Compare
If it's inside cluster, we catch it on the harness execution layer. If it's submitting to a cluster, it doesn't catch the I don't really like putting it in the harness/Native execution layer just to save some lines of code since it really takes some mental effort to understand why the codes of catching
I fixed the |
lol the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good, but please do address the stray exit()
calls I mentioned before merging.
614a09a
to
caf8f92
Compare
@rb-determined-ai addressed |
ship it! thanks for fixing these issues |
caf8f92
to
e81b3f8
Compare
@clabot check |
@cla-bot check |
@cla-bot[bot] check |
1 similar comment
@cla-bot[bot] check |
The cla-bot has been summoned, and re-checked this pull request! |
Description
Catch SystemExit and error out.
Test Plan
N/A
Commentary (optional)
N/A