Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

PVF worker: Mitigate process leaks #7296

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

mrcnski
Copy link
Contributor

@mrcnski mrcnski commented May 25, 2023

PULL REQUEST

Overview

Use PR_SET_PDEATHSIG on Linux.

To my knowledge the node doesn't do anything to clean up child workers which can result in leaks. The child should eventually die on its own and it's hard to see how this can be abused, but this change just hardens the situation and is a good practice.

Sources

Related

#6861 - origin of the node/worker mismatch check

Use `PR_SET_PDEATHSIG` on Linux.

To my knowledge the node doesn't do anything to clean up child workers which can
result in leaks. The child should eventually die on its own and it's hard to see
how this can be abused, but this change just hardens the situation and is a good
practice.
@mrcnski mrcnski added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit. labels May 25, 2023
@mrcnski mrcnski self-assigned this May 25, 2023
Comment on lines 85 to 92
// Use `PR_SET_PDEATHSIG` to ensure that the child is sent a kill signal when the parent dies.
//
// NOTE: This technically has a race as the parent may have already died. In that case we will
// fail to read the socket later and just shutdown.
#[cfg(linux)]
if libc::prctl(libc::PR_SET_PDEATHSIG, libc::SIGKILL, 0, 0, 0) != 0 {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning: the "parent" in this case is considered to be the
              thread that created this process.  In other words, the
              signal will be sent when that thread terminates (via, for
              example, [pthread_exit(3)](https://man7.org/linux/man-pages/man3/pthread_exit.3.html)), rather than after all of the
              threads in the parent process terminate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AKA, this isn't doing what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, hmm, I think since threads have their own pid this still addresses the race identified in this PR, I think.

// NOTE: On non-Linux this has a race condition between getting the pid and sending the
// signal -- the parent may have died and another process been assigned the same pid. On
// Linux this is not a problem -- we use `PR_SET_PDEATHSIG` to ensure that the child is sent
// a kill signal when the parent dies.
// [...]
let ppid = libc::getppid();
if ppid > 1 {
    libc::kill(ppid, libc::SIGTERM);
}

So the issue moves up a level, i.e. it becomes about avoiding leaked threads in the parent as opposed to leaked/orphaned child processes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure which race you now mean, the one in the comment above or that we don't kill workers on node exit?

Copy link
Contributor Author

@mrcnski mrcnski May 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the race between getting the pid and sending the kill signal is addressed even with the caveat you posted from man.

And as you pointed out threads will exit when the process does, so the caveat doesn't actually change anything. So I think PR_SET_PDEATHSIG works.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still somewhat equivalent, pvf-future is spawned on blocking pool, it can only be terminated along with a thread from the pool.

Why do we spawn this on a blocking pool?

But yeah it would need to be documented that this is a hard requirement now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @slumber I'm not really following - maybe my brain is running slow, can you explain more? I'm also not sure we should document implementation details too much, PR_SET_PDEATHSIG should work regardless of how the host is implemented.

I also removed some of the comments for now. Maybe documenting such rare corner cases was overkill, it just leads to noise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we spawn this on a blocking pool?

I honestly don't know.

can you explain more?

Basti quoted documentation mentioning that your prctl call will terminate worker once parent thread terminates, not the node itself. Suppose pvf-future spawns worker and moves to a different tokio thread (tokio workers steal work). This thread terminating would mean that a worker would get killed, and, as Basti mentioned, you're not doing what you wanted to.

Then I noted that pvf future in fact is spawned on a blocking pool. Essentially, tokio::spawn_blocking(block_on(pvf_host)). Given this, your PR works. This thread terminating is equivalent of terminating pvf host, which is sufficient for worker termination. If we replace spawn_blocking with spawn, your PR doesn't work anymore, this should be documented.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you are correct. I'm not sure how I feel about relying on it being a blocking pool, since that fact can change.

Does work-stealing mean that the thread that spawned work can actually die while the work is still active? That would surprise me since it seems unsound (precisely for cases like PR_SET_PDEATHSIG), but TBH I don't know how tokio's scheduler works under the hood.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Work stealing means each worker thread has its own queue and they steal tasks from each other, "work stealing" is a term used by golang scheduler and tokio, you can read about it.

The main point is that task can move to another thread. For example, pvf future is on thread A and blocked, thread A steals another future and panics.

node/core/pvf/common/src/worker.rs Show resolved Hide resolved
@s0me0ne-unkn0wn
Copy link
Contributor

I'd start a step back and try to reproduce the behavior when the node is killed, and workers persist. Did you ever come across such behavior? I don't really understand how it is possible. IIUC there are only two possible scenarios:

  1. When the node is killed, the worker is waiting in blocking recv_request().await?. When the node dies, the OS frees its resources along with UNIX sockets, await returns Err, the loop is terminated, and the worker exits;
  2. When the node is killed, the worker is doing something. In that case, it will finish what it is doing and after that will send_response().await? which will also result in loop termination as the socket doesn't exist anymore.

@mrcnski
Copy link
Contributor Author

mrcnski commented May 26, 2023

Agreed @s0me0ne-unkn0wn, it is hard to see how not killing the worker can lead to problems in practice. Killing it is just a standard hardening step though and it seemed simple enough to add.

A theoretical problem I noticed before this change is the race when killing the parent node in kill_parent_node_in_emergency. We get the parent pid and then send a kill signal to that pid, but between those steps the parent may have died and a different process been assigned that same pid. One way to fix that is by using pidfd's (another Linux-only feature), but what I did here seemed simpler and like a more general hardening step.

@s0me0ne-unkn0wn
Copy link
Contributor

We get the parent pid and then send a kill signal to that pid, but between those steps the parent may have died and a different process been assigned that same pid.

Afaik PIDs are allocated in a sequence that only wraps around when some maximum PID number limit is reached. The highest PID on my system right now is 1555131, so it seems the limit is rather high. For the race condition to become a problem, it has to wrap around between two sequential syscalls (getppid() and kill()), which doesn't sound like a very realistic scenario, and killing the node is a very narrow corner case by itself. I'm not sure if any effort should be put into trying to prevent that 🤔

@mrcnski
Copy link
Contributor Author

mrcnski commented May 26, 2023

I agree it's a corner case. There have been exploits that benefited from leaky processes but I don't know the details of how they worked. I do have a vague idea of this being useful under system pressure where processes are dying and being restarted en masse. I also have a vague idea of a malicious PVF killing the parent node somehow so it can have more time to be malicious (though right now it can just spawn a new process to avoid the death signal).

But putting in PR_SET_PDEATHSIG is pretty standard and shouldn't be controversial, and can be a simple hardening step in the interim until full sandboxing. I can remove the comments I added if they are excessive noise.

@stakeworld
Copy link
Contributor

I think this is identical to #5726 from a while ago. My idea is that it would be logical to kill the workers if they don't have a function anymore, maybe for security reasons but also just to tidy up the process overview

@bkchr
Copy link
Member

bkchr commented May 27, 2023

I think this is identical to #5726 from a while ago. My idea is that it would be logical to kill the workers if they don't have a function anymore, maybe for security reasons but also just to tidy up the process overview

This is not related and as long as the node is an active validator we should not bring down the workers. However, killing them if the node is not part of the set makes sense.

@bkchr
Copy link
Member

bkchr commented May 27, 2023

it has to wrap around between two sequential syscalls (getppid() and kill()), which doesn't sound like a very realistic scenario,

One thing that should be considered is that the returned parent process can change, if the parent was stopped in between. Another race could be that the node ends between starting the worker and registering PR_SET_PDEATHSIG

@mrcnski
Copy link
Contributor Author

mrcnski commented May 28, 2023

However, killing them if the node is not part of the set makes sense.

Might be clean to shut down the whole validation subsystem in that case, which should have the same effect. Can reopen that issue if we want this.

Another race could be that the node ends between starting the worker and registering PR_SET_PDEATHSIG

There are ways to prevent this race involving a pipe from the parent, but that requires extra code on the parent. It should be okay if we try to read from the socket later, get an error and exit...

@mrcnski mrcnski requested a review from slumber May 28, 2023 11:42

// Explicitly terminate the worker here. Not strictly necessary as it would end when it
// fails to read from the socket later.
libc::kill(worker_pid as i32, libc::SIGKILL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why do it? Not saying we should not, but there should be some reason.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was just to be explicit, and so we don't rely on coincidental architectural details that may change.

But, maybe we should instead have it be explicit that the worker dies when it can't read from the socket. The parent may also die for some reason other than us killing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say, if we already die when read fails, let's leave it at that, but document that error handling accordingly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit.
Projects
No open projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

6 participants