Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: catch SystemExit and error out [DET-2956] #1116

Merged
merged 1 commit into from
Sep 1, 2020

Conversation

shiyuann
Copy link
Contributor

@shiyuann shiyuann commented Aug 18, 2020

Description

Catch SystemExit and error out.

Test Plan

N/A

Commentary (optional)

N/A

Copy link
Member

@rb-determined-ai rb-determined-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the hardest way to do this change, because we are trying to mark every single function that involves user code.

In reality, we are writing a library and we should never be calling sys.exit() except around entrypoint when we are e.g. checking command line arguments.

So maybe we can just change all of the non-entrypoint sys.exit()'s that we have laying around (it's just a few) with AssertionErrors or something, and just wrap the following 3 functions calls with checks for sys.exit():

  • build_and_run_training_pipeline in determined/exec/harness.py
  • main() in determined/exec/worker_process.py (you'll have to move the argv check out of the main() function)
  • local_experiment in determined_cli/experiment.py

I think that would be guaranteed to catch all of the sys.exit calls that a user could possibly make, and we only have to have the check in 3 different places.

harness/determined/_execution.py Outdated Show resolved Hide resolved
harness/determined/_execution.py Outdated Show resolved Hide resolved
@rb-determined-ai
Copy link
Member

@shiyuann you have not responded to my high-level points from the previous review

@shiyuann
Copy link
Contributor Author

@rb-determined-ai Sorry, I thought I had already responded.

There are more than just three places. We also need to catch SystemExit in any places that use the _local_execution_manager(). Using this way means we are catching the same error in different layers of the system, both on the harness execution layer and Native API execution layer, which sounds messy to me. It's more about making the system have a good division of layers than how many lines of code you could save.

@shiyuann
Copy link
Contributor Author

@rb-determined-ai Never mind. I can put it on the harness execution layer.

@shiyuann shiyuann force-pushed the ignore-sys-exit branch 2 times, most recently from e862662 to de7197c Compare August 27, 2020 19:15
@shiyuann shiyuann changed the title fix: warn out if catching SystemExit [DET-2956] fix: catch SystemExit and error out [DET-2956] Aug 27, 2020
Copy link
Member

@rb-determined-ai rb-determined-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is way cleaner.

Do you know, will we hit issues with our own sys.exit() in det._native._init_cluster_mode()?

There are also additional sys.exit() calls to remove in:

  • _estimator_trial.py
  • gpu.py
  • ipc.py

harness/determined/exec/harness.py Outdated Show resolved Hide resolved
harness/determined/exec/worker_process.py Outdated Show resolved Hide resolved
@shiyuann
Copy link
Contributor Author

Do you know, will we hit issues with our own sys.exit() in det._native._init_cluster_mode()?

If it's inside cluster, we catch it on the harness execution layer. If it's submitting to a cluster, it doesn't catch the SystemExit. So it would not be an issue.

I don't really like putting it in the harness/Native execution layer just to save some lines of code since it really takes some mental effort to understand why the codes of catching SystemExit are there and why _init_cluster_mode doesn't have that.

There are also additional sys.exit() calls to remove in:
_estimator_trial.py
gpu.py
ipc.py

I fixed the gpu.py. Which sys.exit do you mean in _estimator_trial.py and ipc.py? I haven't seen any code that uses it.

@rb-determined-ai
Copy link
Member

lol ipc.py has two calls to exit(1) which are not even valid python code.

the _estimator_trial.py also has a bare exit(1) which is not valid but which will never ever be hit, since it's the line after an AssertionError.

Copy link
Member

@rb-determined-ai rb-determined-ai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good, but please do address the stray exit() calls I mentioned before merging.

@shiyuann
Copy link
Contributor Author

shiyuann commented Sep 1, 2020

@rb-determined-ai addressed

@shiyuann shiyuann assigned shiyuann and unassigned rb-determined-ai Sep 1, 2020
@rb-determined-ai
Copy link
Member

ship it! thanks for fixing these issues

@rb-determined-ai
Copy link
Member

@clabot check

@rb-determined-ai
Copy link
Member

@cla-bot check

@rb-determined-ai
Copy link
Member

@cla-bot[bot] check

1 similar comment
@rb-determined-ai
Copy link
Member

@cla-bot[bot] check

@cla-bot
Copy link

cla-bot bot commented Sep 1, 2020

The cla-bot has been summoned, and re-checked this pull request!

@shiyuann shiyuann merged commit 13deed4 into determined-ai:master Sep 1, 2020
@shiyuann shiyuann deleted the ignore-sys-exit branch September 1, 2020 19:21
@dannysauer dannysauer added this to the 0.13.2 milestone Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants