-
Notifications
You must be signed in to change notification settings - Fork 1.6k
PVF worker: Mitigate process leaks #7296
base: master
Are you sure you want to change the base?
Conversation
Use `PR_SET_PDEATHSIG` on Linux. To my knowledge the node doesn't do anything to clean up child workers which can result in leaks. The child should eventually die on its own and it's hard to see how this can be abused, but this change just hardens the situation and is a good practice.
node/core/pvf/common/src/worker.rs
Outdated
// Use `PR_SET_PDEATHSIG` to ensure that the child is sent a kill signal when the parent dies. | ||
// | ||
// NOTE: This technically has a race as the parent may have already died. In that case we will | ||
// fail to read the socket later and just shutdown. | ||
#[cfg(linux)] | ||
if libc::prctl(libc::PR_SET_PDEATHSIG, libc::SIGKILL, 0, 0, 0) != 0 { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning: the "parent" in this case is considered to be the
thread that created this process. In other words, the
signal will be sent when that thread terminates (via, for
example, [pthread_exit(3)](https://man7.org/linux/man-pages/man3/pthread_exit.3.html)), rather than after all of the
threads in the parent process terminate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AKA, this isn't doing what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, hmm, I think since threads have their own pid this still addresses the race identified in this PR, I think.
// NOTE: On non-Linux this has a race condition between getting the pid and sending the
// signal -- the parent may have died and another process been assigned the same pid. On
// Linux this is not a problem -- we use `PR_SET_PDEATHSIG` to ensure that the child is sent
// a kill signal when the parent dies.
// [...]
let ppid = libc::getppid();
if ppid > 1 {
libc::kill(ppid, libc::SIGTERM);
}
So the issue moves up a level, i.e. it becomes about avoiding leaked threads in the parent as opposed to leaked/orphaned child processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure which race you now mean, the one in the comment above or that we don't kill workers on node exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean the race between getting the pid and sending the kill signal is addressed even with the caveat you posted from man
.
And as you pointed out threads will exit when the process does, so the caveat doesn't actually change anything. So I think PR_SET_PDEATHSIG
works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still somewhat equivalent, pvf-future is spawned on blocking pool, it can only be terminated along with a thread from the pool.
Why do we spawn this on a blocking pool?
But yeah it would need to be documented that this is a hard requirement now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @slumber I'm not really following - maybe my brain is running slow, can you explain more? I'm also not sure we should document implementation details too much, PR_SET_PDEATHSIG
should work regardless of how the host is implemented.
I also removed some of the comments for now. Maybe documenting such rare corner cases was overkill, it just leads to noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we spawn this on a blocking pool?
I honestly don't know.
can you explain more?
Basti quoted documentation mentioning that your prctl call will terminate worker once parent thread terminates, not the node itself. Suppose pvf-future spawns worker and moves to a different tokio thread (tokio workers steal work). This thread terminating would mean that a worker would get killed, and, as Basti mentioned, you're not doing what you wanted to.
Then I noted that pvf future in fact is spawned on a blocking pool. Essentially, tokio::spawn_blocking(block_on(pvf_host))
. Given this, your PR works. This thread terminating is equivalent of terminating pvf host, which is sufficient for worker termination. If we replace spawn_blocking
with spawn
, your PR doesn't work anymore, this should be documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you are correct. I'm not sure how I feel about relying on it being a blocking pool, since that fact can change.
Does work-stealing mean that the thread that spawned work can actually die while the work is still active? That would surprise me since it seems unsound (precisely for cases like PR_SET_PDEATHSIG
), but TBH I don't know how tokio's scheduler works under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Work stealing means each worker thread has its own queue and they steal tasks from each other, "work stealing" is a term used by golang scheduler and tokio, you can read about it.
The main point is that task can move to another thread. For example, pvf future is on thread A and blocked, thread A steals another future and panics.
I'd start a step back and try to reproduce the behavior when the node is killed, and workers persist. Did you ever come across such behavior? I don't really understand how it is possible. IIUC there are only two possible scenarios:
|
Agreed @s0me0ne-unkn0wn, it is hard to see how not killing the worker can lead to problems in practice. Killing it is just a standard hardening step though and it seemed simple enough to add. A theoretical problem I noticed before this change is the race when killing the parent node in |
Afaik PIDs are allocated in a sequence that only wraps around when some maximum PID number limit is reached. The highest PID on my system right now is 1555131, so it seems the limit is rather high. For the race condition to become a problem, it has to wrap around between two sequential syscalls ( |
I agree it's a corner case. There have been exploits that benefited from leaky processes but I don't know the details of how they worked. I do have a vague idea of this being useful under system pressure where processes are dying and being restarted en masse. I also have a vague idea of a malicious PVF killing the parent node somehow so it can have more time to be malicious (though right now it can just spawn a new process to avoid the death signal). But putting in PR_SET_PDEATHSIG is pretty standard and shouldn't be controversial, and can be a simple hardening step in the interim until full sandboxing. I can remove the comments I added if they are excessive noise. |
I think this is identical to #5726 from a while ago. My idea is that it would be logical to kill the workers if they don't have a function anymore, maybe for security reasons but also just to tidy up the process overview |
This is not related and as long as the node is an active validator we should not bring down the workers. However, killing them if the node is not part of the set makes sense. |
One thing that should be considered is that the returned parent process can change, if the parent was stopped in between. Another race could be that the node ends between starting the worker and registering PR_SET_PDEATHSIG |
Might be clean to shut down the whole validation subsystem in that case, which should have the same effect. Can reopen that issue if we want this.
There are ways to prevent this race involving a pipe from the parent, but that requires extra code on the parent. It should be okay if we try to read from the socket later, get an error and exit... |
|
||
// Explicitly terminate the worker here. Not strictly necessary as it would end when it | ||
// fails to read from the socket later. | ||
libc::kill(worker_pid as i32, libc::SIGKILL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why do it? Not saying we should not, but there should be some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea was just to be explicit, and so we don't rely on coincidental architectural details that may change.
But, maybe we should instead have it be explicit that the worker dies when it can't read from the socket. The parent may also die for some reason other than us killing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say, if we already die when read fails, let's leave it at that, but document that error handling accordingly.
PULL REQUEST
Overview
Use
PR_SET_PDEATHSIG
on Linux.To my knowledge the node doesn't do anything to clean up child workers which can result in leaks. The child should eventually die on its own and it's hard to see how this can be abused, but this change just hardens the situation and is a good practice.
Sources
Related
#6861 - origin of the node/worker mismatch check