Kill dangling subprocesses #632

ivanpauno · 2022-07-25T19:47:28Z

Fixes #545.

I used psutil to figure out children of a process recursively.
It's an easy way to handle this issue platform independently.

For posix OSs, we could send a signal to the process group, but for that we should create a new process group when launching a process, which I'm not sure if it's the best ideal.

launch/launch/actions/execute_local.py

hidmic · 2022-07-25T20:51:45Z

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

ivanpauno · 2022-07-25T22:28:37Z

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

IIUC there's no equivalent to process groups in windows.
We could use them in posix-like OSs though

(edit) I have to double check this though

ivanpauno · 2022-07-26T14:35:08Z

@ivanpauno before commenting anything, why explicitly list all processes instead of using process groups (and whatever Windows has as an equivalent)?

I have a few reasons:

There's no multiplatform way of doing that.
- For linux/posix-like we could use https://docs.python.org/3/library/os.html#os.setpgrp to create a new group for the subprocess.
- For windows, https://docs.python.org/3/library/subprocess.html#subprocess.CREATE_NEW_PROCESS_GROUP.
  - Ctrl-Break event only works for "console processes", see here.
    I'm not sure if we can guarantee that launch will be run in a "console".
A child process may run another subprocess with a different group id. If we want to kill all dangling subprocesses, I think this works better.

Maybe we shouldn't have this feature at all, and only log a warning if a "dangling" subprocess is detected.

jacobperron

Seems like a reasonable feature to me.

Maybe we shouldn't have this feature at all, and only log a warning if a "dangling" subprocess is detected.

Is there a scenario you had in mind where we wouldn't want a subprocess to end?

wjwwood

Do we not need to wait for those children to exit?

Some like this example:

https://psutil.readthedocs.io/en/latest/index.html?highlight=children#kill-process-tree

Where it uses psutil.wait_procs?

It would be nice to notify the user in the launch output if the child process still did not exit after some period of time.

Furthermore, does this timer always wait until it expires and then check the subprocesses, or does it only do this if the parent process has to be sent SIGTERM/SIGKILL?

Also, what happens when all the subprocess exit cleanly and quickly? Does this timer still wait and then check them?

It feels like we're missing something here where we await these processes exiting and only send them signals if they don't exit.

launch/launch/actions/execute_local.py

ivanpauno · 2022-07-28T13:54:43Z

Where it uses psutil.wait_procs?

It would be nice to notify the user in the launch output if the child process still did not exit after some period of time.

I don't think we can block when executing an action, so I would need another timer to check if the processes were killed.
Anyway, I'm sending SIGKILL, which cannot be ignored.

Furthermore, does this timer always wait until it expires and then check the subprocesses, or does it only do this if the parent process has to be sent SIGTERM/SIGKILL?

It always sends the signals to all subprocesses, event if SIGTERM/SIGKILL was not needed.

Also, what happens when all the subprocess exit cleanly and quickly? Does this timer still wait and then check them?

Yes, and "process does not exist" errors are ignored when trying to kill the subprocesses that were previously detected in the tree for this reason.
I'm counting with the OS not reassinging the same pid shortly after.

It feels like we're missing something here where we await these processes exiting and only send them signals if they don't exit.

Do you mean we should kill subprocesses going "level by level"?
i.e. first kill the process launched with ExecuteLocal.
After waiting that process was killed, kills its subprocesses that are still running.
Is that your idea?
Should this be repeated recursively or only for one level?

wjwwood · 2022-07-28T23:04:00Z

I don't think we can block when executing an action, so I would need another timer to check if the processes were killed.

We can await the processes to exit, either with a thread that sets a future, or if possible using existing asyncio subprocess stuff.

Anyway, I'm sending SIGKILL, which cannot be ignored.

It cannot be caught, but it doesn't mean that it will definitely exit, or how long that will take. I've definitely had some processes fail to exit on kill -9 before, but usually the system is borked then. But the point would be to notify the user (so they don't have to go crawling to ps to see what stayed around), not to do anything else, since there's nothing else we could do after SIGKILL.

It always sends the signals to all subprocesses, event if SIGTERM/SIGKILL was not needed.

Yes, and "process does not exist" errors are ignored when trying to kill the subprocesses that were previously detected in the tree for this reason.

My point was more about the waiting. If the default timeout of sigkill_subprocesses_timeout is like 5 seconds, but all the processes exit immediately, do we wait 5 seconds, send a bunch of signals we don't need to send, then finally exit?

I'm counting with the OS not reassinging the same pid shortly after.

Is that a safe assumption?

Do you mean we should kill subprocesses going "level by level"?
i.e. first kill the process launched with ExecuteLocal.
After waiting that process was killed, kills its subprocesses that are still running.
Is that your idea?

No, that's not what I'm suggesting.

Instead I was thinking something like this:

send SIGINT to the "main process" launch started
collect all children recursively, and start monitoring their status (checking for termination)
wait for the main process to exit, or for SIGTERM timeout
if SIGTERM timeout, escalate to SIGKILL for main process and wait
when the main process exits or it fails to even after SIGTERM and some time, check all child processes
SIGTERM each child process
await the processes, if they don't exit SIGKILL the ones alive
when all have exited, or some time has passed after SIGKILL, finish
log an error in launch for each process that never exited, even after SIGKILL

This process is good in my opinion because it:

exits as soon as all processes have finished
ensures all processes finish, or else logs that they don't
avoids jumping to SIGKILL untill absolutely needed
avoids trying to terminate the child processes until the main process has exited or until it failed to exit after SIGKILL
starts tracking the child process statuses before trying to terminate them, so (maybe) you don't have to worry about the OS reassigning them, because you'll be notified (via an await or a thread waiting on them with psutil or similar) when they exit

Obviously the escalation process for the main process and the child processes are the same, so maybe we can generalize that, so it can be used in ExecuteLocal as well as with any individual process we have a pid for.

What I outlined is a bit more work to implement, but I also think it's a lot more thorough and is likely to save us (and our users) some time in the future debugging these kind of issues.

wjwwood · 2022-07-28T23:07:37Z

Do you mean we should kill subprocesses going "level by level"?

Another comment about this.

I wasn't thinking that, I was still thinking doing it in a batch operation, but I also don't think doing it layer by layer is a bad idea, though it might be really annoying if it takes a long time.

The reason I like it as an idea is that I want to give the called processes every chance to behave, and doing a shutdown escalation level by level (top to bottom) is the best way to do that.

I could see this as something of a configuration. If we decide to implement what I outlined above, then making it possible to do it step by step instead of in batch should be easy-ish to do. We just need to keep that in mind when working on it.

ivanpauno · 2022-07-29T14:41:59Z

Is that a safe assumption?

I'm not sure. If that's not a safe assumption, then we cannot do anything.
The best we can do is to notify the user "it looks like a launched process left some subprocesses alive after being killed: {pids}".

starts tracking the child process statuses before trying to terminate them, so (maybe) you don't have to worry about the OS reassigning them

But you cannot really do this:

asyncio only allows to interact with subprocesses that you directly launched.
OSs don't provide a way to do this AFAIK. It's possible for a process you launched directly, but it's not possible for recursive subprocesses (I think waitpid is limited to direct children in linux).

ivanpauno · 2022-07-29T14:50:41Z

The alternative is to create a new group id when launching a process, and then send a signal to the created group.
The problem is that we have to handle the posix and windows case separately, but maybe we can find something it works for both.

wjwwood · 2022-07-29T19:18:53Z

OSs don't provide a way to do this AFAIK. It's possible for a process you launched directly, but it's not possible for recursive subprocesses (I think waitpid is limited to direct children in linux).

Then how does https://psutil.readthedocs.io/en/latest/index.html?highlight=children#psutil.wait_procs work? In their example they enumerate the children and then wait for them to exit.

ivanpauno · 2022-07-29T19:43:50Z

Then how does https://psutil.readthedocs.io/en/latest/index.html?highlight=children#psutil.wait_procs work? In their example they enumerate the children and then wait for them to exit.

polling
https://github.com/giampaolo/psutil/blob/379598312f60bf414afa8bf549f7f26af9e578ea/psutil/__init__.py#L1532-L1550
https://github.com/giampaolo/psutil/blob/57ed46de8a988e7ab26279c2a967fb15b05397a3/psutil/_psposix.py#L120-L125

wjwwood · 2022-07-29T23:31:36Z

Ok, I see, so no actual blocking is realistic, and it does it one at a time, so you may "miss" the exit of one process while waiting on another.

So maybe that doesn't help with the race between process exit and pid reuse, but I still think waiting to see if they exit is a decent idea. There's nothing worse than having to go back to ps or something to figure out what was left behind, especially if launch could just tell us.

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

wjwwood · 2022-07-29T23:32:02Z

The alternative is to create a new group id when launching a process, and then send a signal to the created group.

What benefit does that give us? (curious)

ivanpauno · 2022-08-01T16:50:36Z

Ok, I see, so no actual blocking is realistic, and it does it one at a time, so you may "miss" the exit of one process while waiting on another.

Just in case, polling is not only used to check the status of more than one process, it's also used to "wait for a process to exit" if the process is not a child.
Waiting for a process that's not a child is just a loop with a "sleep" between tries, checking if a process with pid==x exists (see here).

but I still think waiting to see if they exit is a decent idea

Sounds good.
But given the API we have available, I think it's a good idea to use a launch timer for that.
Is that okay?
If another way is preferred, please specify which.

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well?
That sounds fine to me, I can add that.

What benefit does that give us? (curious)

You send only one signal to the group, though the part to monitor if the processes exited or not doesn't change.
So it doesn't seem quite benefitial honestly.

ivanpauno · 2022-08-05T20:17:07Z

And my proposed series of steps have some other advantages, even if this one doesn't pan out.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well?
That sounds fine to me, I can add that.

@wjwwood could you confirm this?

ivanpauno · 2022-08-17T19:20:11Z

@wjwwood friendly ping

wjwwood · 2022-08-24T20:33:36Z

But given the API we have available, I think it's a good idea to use a launch timer for that.
Is that okay?
If another way is preferred, please specify which.

So, I was actually thinking we do something like what psutils does (perhaps using it), but in a thread or maybe as an async coroutine/task. Basically create a list of the pid you're watching, start timers to escalate their signals, then a thread/coroutine-task that iterates over all the pid that we're watching, and for each that have exited cancel the timers and remove them from the list, and then once you iterate, sleep for a fixed short period and then poll again until the list is empty. If, after doing SIGKILL and some period has passed, log it and then remove it from the list of pid to watch.

Do you mean scalating SIGINT -> SIGTERM -> SIGKILL -> "Log if subprocess(es) are still alive" for subprocesses as well?

Correct.

ivanpauno · 2022-09-28T19:23:27Z

So, I was actually thinking we do something like what psutils does (perhaps using it), but in a thread or maybe as an async coroutine/task. Basically create a list of the pid you're watching, start timers to escalate their signals, then a thread/coroutine-task that iterates over all the pid that we're watching, and for each that have exited cancel the timers and remove them from the list, and then once you iterate, sleep for a fixed short period and then poll again until the list is empty. If, after doing SIGKILL and some period has passed, log it and then remove it from the list of pid to watch.

Please, take a look to e326c0d

launch/launch/actions/execute_local.py

ciandonovan · 2022-10-08T23:28:47Z

Has this not been fixed by #475? When running a ROS2 Launch in "non-interactive" mode, SIGINTs are successfully propagated to child nodes, cleanly terminating the process tree without needing Ctrl-C in a controlling terminal. SIGTERMs still aren't handled at all as I mentioned here though #666

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno · 2022-12-05T15:39:06Z

@wjwwood test case was added in 5b43cdc, I also needed to rebase to solve conclicts with rolling

methylDragon · 2022-12-08T23:42:00Z

@ivanpauno just some flake8 linting issues

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

methylDragon · 2022-12-20T09:12:18Z

Linux
Linux-aarch64
Windows

ivanpauno · 2022-12-20T14:27:13Z

mmm, I have to double check those failures, they seem related

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno · 2022-12-20T17:20:33Z

Linux
Linux-aarch64
Windows

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno · 2022-12-20T19:04:42Z

Linux
Linux-aarch64
Windows

ivanpauno · 2022-12-21T14:12:33Z

Windows

ivanpauno · 2022-12-21T18:45:48Z

This is making windows CI hang for some reason ....
I will try to set up a windows vm and see what's going on, but I'm not sure if I will have the time to complete that

wjwwood · 2022-12-21T21:22:01Z

launch/test/launch/test_execute_local.py

+    """Test launching a process with an environment variable."""
+    executable = ExecuteLocal(
+        process_description=Executable(
+            cmd=['python3', '-c', f'"{PYTHON_SCRIPT}"'],


Is this sufficient to test the feature? Is it the shell=True part that makes this test useful?

shell=True will create a shell process, and that shell will create a subproceses.
So the launch process will have to kill the shell subprocess, because the shell will not trap the signals and resend them to the child.

This shows that the feature works.
It's not super complete though, if you have more test case ideas I can add them.

russkel · 2023-06-05T01:59:25Z

My use case is a process spawns some children and then exits, and I do not want launch to exit until the child processes have exited.

I have implemented this in Greenroom-Robotics/launch_ext@850a2f4#diff-7baf6e854cc3c937eaed0b127161c5f82cf86d8a8eaef0038171216a817a0c62R194-R222

My technique was to use the stdin/stdout/stderr inodes and if any shared the same inodes with the parent process they would be considered children to be waited on. I am not sure if this way is ideal but it does seem to work.

mwcondino · 2024-06-26T21:23:46Z

Howdy @ivanpauno - any chance this could still be merged? I've encountered this issue recently, and have a workaround in place, but it would be ideal to handle the issue directly in ExecuteLocal with this fix 😄

ivanpauno added the enhancement New feature or request label Jul 25, 2022

ivanpauno requested review from jacobperron and hidmic July 25, 2022 19:47

ivanpauno self-assigned this Jul 25, 2022

ivanpauno commented Jul 25, 2022

View reviewed changes

launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved

launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved

ivanpauno mentioned this pull request Jul 25, 2022

launch not killing processes started with shell=True #545

Open

jacobperron approved these changes Jul 26, 2022

View reviewed changes

jacobperron requested a review from wjwwood July 27, 2022 20:33

wjwwood reviewed Jul 27, 2022

View reviewed changes

launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved

launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved

launch/launch/actions/execute_local.py Outdated Show resolved Hide resolved

ivanpauno requested a review from wjwwood September 28, 2022 19:23

wjwwood reviewed Oct 6, 2022

View reviewed changes

ivanpauno requested a review from wjwwood October 6, 2022 15:09

ivanpauno added 12 commits December 5, 2022 12:34

Kill dangling subprocesses

f86971a

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Address peer review comments

d268600

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Fix docstring

b5a925f

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Remove leftover comments

e2b1a3c

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Undo unneeded change

8466b49

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Cleanup subprocesses before respawning

2356223

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

please flake8

ac6afd3

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Fix issue detected in tests

7ffee9a

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Improve error handling

6432d76

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Address peer review comments

6284c82

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Remove unnecessary argument in __get_shutdown_timer_actions()

514c149

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Add test case

5b43cdc

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno force-pushed the ivanpauno/kill-dangling-subprocesses branch from ead94fc to 5b43cdc Compare December 5, 2022 15:37

ivanpauno requested a review from wjwwood December 5, 2022 15:38

ivanpauno added 2 commits December 14, 2022 17:29

Fix flake8 failures

4b796cc

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

More flake8 failures

58cd661

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

Seems to fix CI issues...

5ff2f05

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

ivanpauno added 2 commits December 20, 2022 15:30

Second try

b5ba458

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

another try

fc97730

Signed-off-by: Ivan Santiago Paunovic <[email protected]>

wjwwood reviewed Dec 21, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill dangling subprocesses #632

Kill dangling subprocesses #632

ivanpauno commented Jul 25, 2022 •

edited

Loading

hidmic commented Jul 25, 2022

ivanpauno commented Jul 25, 2022 •

edited

Loading

ivanpauno commented Jul 26, 2022

jacobperron left a comment

wjwwood left a comment

ivanpauno commented Jul 28, 2022

wjwwood commented Jul 28, 2022 •

edited

Loading

wjwwood commented Jul 28, 2022

ivanpauno commented Jul 29, 2022 •

edited

Loading

ivanpauno commented Jul 29, 2022

wjwwood commented Jul 29, 2022

ivanpauno commented Jul 29, 2022

wjwwood commented Jul 29, 2022 •

edited

Loading

wjwwood commented Jul 29, 2022

ivanpauno commented Aug 1, 2022

ivanpauno commented Aug 5, 2022

ivanpauno commented Aug 17, 2022

wjwwood commented Aug 24, 2022

ivanpauno commented Sep 28, 2022

ciandonovan commented Oct 8, 2022

ivanpauno commented Dec 5, 2022

methylDragon commented Dec 8, 2022

methylDragon commented Dec 20, 2022

ivanpauno commented Dec 20, 2022

ivanpauno commented Dec 20, 2022

ivanpauno commented Dec 20, 2022

ivanpauno commented Dec 21, 2022

ivanpauno commented Dec 21, 2022

wjwwood Dec 21, 2022

ivanpauno Dec 21, 2022

russkel commented Jun 5, 2023

mwcondino commented Jun 26, 2024

Kill dangling subprocesses #632

Are you sure you want to change the base?

Kill dangling subprocesses #632

Conversation

ivanpauno commented Jul 25, 2022 • edited Loading

hidmic commented Jul 25, 2022

ivanpauno commented Jul 25, 2022 • edited Loading

ivanpauno commented Jul 26, 2022

jacobperron left a comment

Choose a reason for hiding this comment

wjwwood left a comment

Choose a reason for hiding this comment

ivanpauno commented Jul 28, 2022

wjwwood commented Jul 28, 2022 • edited Loading

wjwwood commented Jul 28, 2022

ivanpauno commented Jul 29, 2022 • edited Loading

ivanpauno commented Jul 29, 2022

wjwwood commented Jul 29, 2022

ivanpauno commented Jul 29, 2022

wjwwood commented Jul 29, 2022 • edited Loading

wjwwood commented Jul 29, 2022

ivanpauno commented Aug 1, 2022

ivanpauno commented Aug 5, 2022

ivanpauno commented Aug 17, 2022

wjwwood commented Aug 24, 2022

ivanpauno commented Sep 28, 2022

ciandonovan commented Oct 8, 2022

ivanpauno commented Dec 5, 2022

methylDragon commented Dec 8, 2022

methylDragon commented Dec 20, 2022

ivanpauno commented Dec 20, 2022

ivanpauno commented Dec 20, 2022

ivanpauno commented Dec 20, 2022

ivanpauno commented Dec 21, 2022

ivanpauno commented Dec 21, 2022

wjwwood Dec 21, 2022

Choose a reason for hiding this comment

ivanpauno Dec 21, 2022

Choose a reason for hiding this comment

russkel commented Jun 5, 2023

mwcondino commented Jun 26, 2024

ivanpauno commented Jul 25, 2022 •

edited

Loading

ivanpauno commented Jul 25, 2022 •

edited

Loading

wjwwood commented Jul 28, 2022 •

edited

Loading

ivanpauno commented Jul 29, 2022 •

edited

Loading

wjwwood commented Jul 29, 2022 •

edited

Loading