Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kill dangling subprocesses #632

Open
wants to merge 17 commits into
base: rolling
Choose a base branch
from

Conversation

ivanpauno
Copy link
Member

@ivanpauno ivanpauno commented Jul 25, 2022

Fixes #545.

I used psutil to figure out children of a process recursively.
It's an easy way to handle this issue platform independently.

For posix OSs, we could send a signal to the process group, but for that we should create a new process group when launching a process, which I'm not sure if it's the best ideal.

@ivanpauno ivanpauno added the enhancement New feature or request label Jul 25, 2022
@ivanpauno ivanpauno self-assigned this Jul 25, 2022
launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
@hidmic
Copy link
Contributor

hidmic commented Jul 25, 2022

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

@ivanpauno
Copy link
Member Author

ivanpauno commented Jul 25, 2022

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

IIUC there's no equivalent to process groups in windows.
We could use them in posix-like OSs though

(edit) I have to double check this though

@ivanpauno
Copy link
Member Author

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

I have a few reasons:

Maybe we shouldn't have this feature at all, and only log a warning if a "dangling" subprocess is detected.

Copy link
Member

@jacobperron jacobperron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable feature to me.

Maybe we shouldn't have this feature at all, and only log a warning if a "dangling" subprocess is detected.

Is there a scenario you had in mind where we wouldn't want a subprocess to end?

@jacobperron jacobperron requested a review from wjwwood July 27, 2022 20:33
Copy link
Member

@wjwwood wjwwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not need to wait for those children to exit?

Some like this example:

https://psutil.readthedocs.io/en/latest/index.html?highlight=children#kill-process-tree

Where it uses psutil.wait_procs?

It would be nice to notify the user in the launch output if the child process still did not exit after some period of time.

Furthermore, does this timer always wait until it expires and then check the subprocesses, or does it only do this if the parent process has to be sent SIGTERM/SIGKILL?

Also, what happens when all the subprocess exit cleanly and quickly? Does this timer still wait and then check them?

It feels like we're missing something here where we await these processes exiting and only send them signals if they don't exit.

launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
@ivanpauno
Copy link
Member Author

Where it uses psutil.wait_procs?

It would be nice to notify the user in the launch output if the child process still did not exit after some period of time.

I don't think we can block when executing an action, so I would need another timer to check if the processes were killed.
Anyway, I'm sending SIGKILL, which cannot be ignored.

Furthermore, does this timer always wait until it expires and then check the subprocesses, or does it only do this if the parent process has to be sent SIGTERM/SIGKILL?

It always sends the signals to all subprocesses, event if SIGTERM/SIGKILL was not needed.

Also, what happens when all the subprocess exit cleanly and quickly? Does this timer still wait and then check them?

Yes, and "process does not exist" errors are ignored when trying to kill the subprocesses that were previously detected in the tree for this reason.
I'm counting with the OS not reassinging the same pid shortly after.

It feels like we're missing something here where we await these processes exiting and only send them signals if they don't exit.

Do you mean we should kill subprocesses going "level by level"?
i.e. first kill the process launched with ExecuteLocal.
After waiting that process was killed, kills its subprocesses that are still running.
Is that your idea?
Should this be repeated recursively or only for one level?

@wjwwood
Copy link
Member

wjwwood commented Jul 28, 2022

I don't think we can block when executing an action, so I would need another timer to check if the processes were killed.

We can await the processes to exit, either with a thread that sets a future, or if possible using existing asyncio subprocess stuff.

Anyway, I'm sending SIGKILL, which cannot be ignored.

It cannot be caught, but it doesn't mean that it will definitely exit, or how long that will take. I've definitely had some processes fail to exit on kill -9 before, but usually the system is borked then. But the point would be to notify the user (so they don't have to go crawling to ps to see what stayed around), not to do anything else, since there's nothing else we could do after SIGKILL.

It always sends the signals to all subprocesses, event if SIGTERM/SIGKILL was not needed.

Yes, and "process does not exist" errors are ignored when trying to kill the subprocesses that were previously detected in the tree for this reason.

My point was more about the waiting. If the default timeout of sigkill_subprocesses_timeout is like 5 seconds, but all the processes exit immediately, do we wait 5 seconds, send a bunch of signals we don't need to send, then finally exit?

I'm counting with the OS not reassinging the same pid shortly after.

Is that a safe assumption?

Do you mean we should kill subprocesses going "level by level"?
i.e. first kill the process launched with ExecuteLocal.
After waiting that process was killed, kills its subprocesses that are still running.
Is that your idea?

No, that's not what I'm suggesting.

Instead I was thinking something like this:

  • send SIGINT to the "main process" launch started
  • collect all children recursively, and start monitoring their status (checking for termination)
  • wait for the main process to exit, or for SIGTERM timeout
  • if SIGTERM timeout, escalate to SIGKILL for main process and wait
  • when the main process exits or it fails to even after SIGTERM and some time, check all child processes
  • SIGTERM each child process
  • await the processes, if they don't exit SIGKILL the ones alive
  • when all have exited, or some time has passed after SIGKILL, finish
  • log an error in launch for each process that never exited, even after SIGKILL

This process is good in my opinion because it:

  • exits as soon as all processes have finished
  • ensures all processes finish, or else logs that they don't
  • avoids jumping to SIGKILL untill absolutely needed
  • avoids trying to terminate the child processes until the main process has exited or until it failed to exit after SIGKILL
  • starts tracking the child process statuses before trying to terminate them, so (maybe) you don't have to worry about the OS reassigning them, because you'll be notified (via an await or a thread waiting on them with psutil or similar) when they exit

Obviously the escalation process for the main process and the child processes are the same, so maybe we can generalize that, so it can be used in ExecuteLocal as well as with any individual process we have a pid for.

What I outlined is a bit more work to implement, but I also think it's a lot more thorough and is likely to save us (and our users) some time in the future debugging these kind of issues.

@wjwwood
Copy link
Member

wjwwood commented Jul 28, 2022

Do you mean we should kill subprocesses going "level by level"?

Another comment about this.

I wasn't thinking that, I was still thinking doing it in a batch operation, but I also don't think doing it layer by layer is a bad idea, though it might be really annoying if it takes a long time.

The reason I like it as an idea is that I want to give the called processes every chance to behave, and doing a shutdown escalation level by level (top to bottom) is the best way to do that.

I could see this as something of a configuration. If we decide to implement what I outlined above, then making it possible to do it step by step instead of in batch should be easy-ish to do. We just need to keep that in mind when working on it.

@ivanpauno
Copy link
Member Author

ivanpauno commented Jul 29, 2022

Is that a safe assumption?

I'm not sure. If that's not a safe assumption, then we cannot do anything.
The best we can do is to notify the user "it looks like a launched process left some subprocesses alive after being killed: {pids}".

starts tracking the child process statuses before trying to terminate them, so (maybe) you don't have to worry about the OS reassigning them

But you cannot really do this:

  • asyncio only allows to interact with subprocesses that you directly launched.
  • OSs don't provide a way to do this AFAIK. It's possible for a process you launched directly, but it's not possible for recursive subprocesses (I think waitpid is limited to direct children in linux).

@ivanpauno
Copy link
Member Author

The alternative is to create a new group id when launching a process, and then send a signal to the created group.
The problem is that we have to handle the posix and windows case separately, but maybe we can find something it works for both.

@wjwwood
Copy link
Member

wjwwood commented Jul 29, 2022

OSs don't provide a way to do this AFAIK. It's possible for a process you launched directly, but it's not possible for recursive subprocesses (I think waitpid is limited to direct children in linux).

Then how does https://psutil.readthedocs.io/en/latest/index.html?highlight=children#psutil.wait_procs work? In their example they enumerate the children and then wait for them to exit.

@wjwwood
Copy link
Member

wjwwood commented Jul 29, 2022

Ok, I see, so no actual blocking is realistic, and it does it one at a time, so you may "miss" the exit of one process while waiting on another.

So maybe that doesn't help with the race between process exit and pid reuse, but I still think waiting to see if they exit is a decent idea. There's nothing worse than having to go back to ps or something to figure out what was left behind, especially if launch could just tell us.

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

@wjwwood
Copy link
Member

wjwwood commented Jul 29, 2022

The alternative is to create a new group id when launching a process, and then send a signal to the created group.

What benefit does that give us? (curious)

@ivanpauno
Copy link
Member Author

Ok, I see, so no actual blocking is realistic, and it does it one at a time, so you may "miss" the exit of one process while waiting on another.

Just in case, polling is not only used to check the status of more than one process, it's also used to "wait for a process to exit" if the process is not a child.
Waiting for a process that's not a child is just a loop with a "sleep" between tries, checking if a process with pid==x exists (see here).

but I still think waiting to see if they exit is a decent idea

Sounds good.
But given the API we have available, I think it's a good idea to use a launch timer for that.
Is that okay?
If another way is preferred, please specify which.

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well?
That sounds fine to me, I can add that.

What benefit does that give us? (curious)

You send only one signal to the group, though the part to monitor if the processes exited or not doesn't change.
So it doesn't seem quite benefitial honestly.

@ivanpauno
Copy link
Member Author

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well?
That sounds fine to me, I can add that.

@wjwwood could you confirm this?

@ivanpauno
Copy link
Member Author

@wjwwood friendly ping

@wjwwood
Copy link
Member

wjwwood commented Aug 24, 2022

But given the API we have available, I think it's a good idea to use a launch timer for that.
Is that okay?
If another way is preferred, please specify which.

So, I was actually thinking we do something like what psutils does (perhaps using it), but in a thread or maybe as an async coroutine/task. Basically create a list of the pid you're watching, start timers to escalate their signals, then a thread/coroutine-task that iterates over all the pid that we're watching, and for each that have exited cancel the timers and remove them from the list, and then once you iterate, sleep for a fixed short period and then poll again until the list is empty. If, after doing SIGKILL and some period has passed, log it and then remove it from the list of pid to watch.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well?

Correct.

@ivanpauno
Copy link
Member Author

So, I was actually thinking we do something like what psutils does (perhaps using it), but in a thread or maybe as an async coroutine/task. Basically create a list of the pid you're watching, start timers to escalate their signals, then a thread/coroutine-task that iterates over all the pid that we're watching, and for each that have exited cancel the timers and remove them from the list, and then once you iterate, sleep for a fixed short period and then poll again until the list is empty. If, after doing SIGKILL and some period has passed, log it and then remove it from the list of pid to watch.

Please, take a look to e326c0d

launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
launch/launch/actions/execute_local.py Show resolved Hide resolved
launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved
@ivanpauno ivanpauno requested a review from wjwwood October 6, 2022 15:09
@ciandonovan
Copy link

Has this not been fixed by #475? When running a ROS2 Launch in "non-interactive" mode, SIGINTs are successfully propagated to child nodes, cleanly terminating the process tree without needing Ctrl-C in a controlling terminal. SIGTERMs still aren't handled at all as I mentioned here though #666

Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno ivanpauno force-pushed the ivanpauno/kill-dangling-subprocesses branch from ead94fc to 5b43cdc Compare December 5, 2022 15:37
@ivanpauno
Copy link
Member Author

@wjwwood test case was added in 5b43cdc, I also needed to rebase to solve conclicts with rolling

@methylDragon
Copy link
Contributor

@ivanpauno just some flake8 linting issues

Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@methylDragon
Copy link
Contributor

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@ivanpauno
Copy link
Member Author

mmm, I have to double check those failures, they seem related

Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

Signed-off-by: Ivan Santiago Paunovic <[email protected]>
Signed-off-by: Ivan Santiago Paunovic <[email protected]>
@ivanpauno
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Windows Build Status

@ivanpauno
Copy link
Member Author

  • Windows Build Status

@ivanpauno
Copy link
Member Author

This is making windows CI hang for some reason ....
I will try to set up a windows vm and see what's going on, but I'm not sure if I will have the time to complete that

"""Test launching a process with an environment variable."""
executable = ExecuteLocal(
process_description=Executable(
cmd=['python3', '-c', f'"{PYTHON_SCRIPT}"'],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient to test the feature? Is it the shell=True part that makes this test useful?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shell=True will create a shell process, and that shell will create a subproceses.
So the launch process will have to kill the shell subprocess, because the shell will not trap the signals and resend them to the child.

This shows that the feature works.
It's not super complete though, if you have more test case ideas I can add them.

@russkel
Copy link

russkel commented Jun 5, 2023

My use case is a process spawns some children and then exits, and I do not want launch to exit until the child processes have exited.

I have implemented this in Greenroom-Robotics/launch_ext@850a2f4#diff-7baf6e854cc3c937eaed0b127161c5f82cf86d8a8eaef0038171216a817a0c62R194-R222

My technique was to use the stdin/stdout/stderr inodes and if any shared the same inodes with the parent process they would be considered children to be waited on. I am not sure if this way is ideal but it does seem to work.

@mwcondino
Copy link

Howdy @ivanpauno - any chance this could still be merged? I've encountered this issue recently, and have a workaround in place, but it would be ideal to handle the issue directly in ExecuteLocal with this fix 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

launch not killing processes started with shell=True
8 participants