Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill descendant processes in core.direct schedulers plugin #6572

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

agoscinski
Copy link
Contributor

@agoscinski agoscinski commented Sep 26, 2024

Proposal to solve #6571

In the direct scheduler we use psutil to obtain a list of descendant processes so we can kill all of them. This issue does not happen in the other scheduler as the job scheduler takes care of this. Here we have to manage the killing of the descendants by ourself.

Copy link

codecov bot commented Sep 26, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 77.85%. Comparing base (ef60b66) to head (959574e).
Report is 114 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6572      +/-   ##
==========================================
+ Coverage   77.51%   77.85%   +0.35%     
==========================================
  Files         560      566       +6     
  Lines       41444    42044     +600     
==========================================
+ Hits        32120    32730     +610     
+ Misses       9324     9314      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@agoscinski agoscinski marked this pull request as ready for review September 26, 2024 14:26
@agoscinski agoscinski requested review from sphuber and khsrali and removed request for khsrali September 26, 2024 14:28
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agoscinski , really fast in hunting bugs :)
I've put a minor comment,
In anycase, would be nice to add some regression tests.

process_ids.extend([str(child.pid) for child in children])
process_ids_str = ' '.join(process_ids)

submit_command = f'kill {process_ids_str}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a side node:
I've encountered cases where kill PID silently returns without actually killing a job.
I would suggest handling this scenario, if PID still exists after sending the command kill PID.
then properly inform with a log message.

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @agoscinski . Tests seem to be hanging so need to fix those and have a few comments

def _get_kill_command(self, jobid):
"""Return the command to kill the job with specified jobid."""
submit_command = f'kill {jobid}'
def _get_kill_command(self, process_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing jobid to process_id you broke the log line on line 370. Either keep it as jobid or adapt other lines that referenced it accordingly. This would be a breaking change, but since it is an internal method it is ok to change

# get a list of the process id of all descendants
process = Process(int(process_id))
children = process.children(recursive=True)
process_ids = [process_id]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should cast to str here explicitly to be safe. Before, it was used in an f-string, which automatically casts, but now you are using it as arguments to ' '.join() which will fail if the elements are not all strings.

Suggested change
process_ids = [process_id]
process_ids = [str(process_id)]

process_ids.extend([str(child.pid) for child in children])
process_ids_str = ' '.join(process_ids)

submit_command = f'kill {process_ids_str}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well take the opportunity to fix the variable name

Suggested change
submit_command = f'kill {process_ids_str}'
kill_command = f'kill {process_ids_str}'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants