PVF worker: Mitigate process leaks #7296

mrcnski · 2023-05-25T23:27:08Z

PULL REQUEST

Overview

Use PR_SET_PDEATHSIG on Linux.

To my knowledge the node doesn't do anything to clean up child workers which can result in leaks. The child should eventually die on its own and it's hard to see how this can be abused, but this change just hardens the situation and is a good practice.

Sources

Got the idea from: https://docs.rs/unshare/latest/unshare/
man page: https://man7.org/linux/man-pages/man2/prctl.2.html
https://catern.com/process.html#org19f1dea

+	// Use `PR_SET_PDEATHSIG` to ensure that the child is sent a kill signal when the parent dies.
+	//
+	// NOTE: This technically has a race as the parent may have already died. In that case we will
+	// fail to read the socket later and just shutdown.
+	#[cfg(linux)]
+	if libc::prctl(libc::PR_SET_PDEATHSIG, libc::SIGKILL, 0, 0, 0) != 0 {
+		return
+	}


Warning: the "parent" in this case is considered to be the thread that created this process. In other words, the signal will be sent when that thread terminates (via, for example, [pthread_exit(3)](https://man7.org/linux/man-pages/man3/pthread_exit.3.html)), rather than after all of the threads in the parent process terminate.

AKA, this isn't doing what you think.

Good catch, hmm, I think since threads have their own pid this still addresses the race identified in this PR, I think.

// NOTE: On non-Linux this has a race condition between getting the pid and sending the // signal -- the parent may have died and another process been assigned the same pid. On // Linux this is not a problem -- we use `PR_SET_PDEATHSIG` to ensure that the child is sent // a kill signal when the parent dies. // [...] let ppid = libc::getppid(); if ppid > 1 { libc::kill(ppid, libc::SIGTERM); }

So the issue moves up a level, i.e. it becomes about avoiding leaked threads in the parent as opposed to leaked/orphaned child processes.

Not sure which race you now mean, the one in the comment above or that we don't kill workers on node exit?

I mean the race between getting the pid and sending the kill signal is addressed even with the caveat you posted from man.

And as you pointed out threads will exit when the process does, so the caveat doesn't actually change anything. So I think PR_SET_PDEATHSIG works.

This is still somewhat equivalent, pvf-future is spawned on blocking pool, it can only be terminated along with a thread from the pool.

Why do we spawn this on a blocking pool?

But yeah it would need to be documented that this is a hard requirement now.

Sorry @slumber I'm not really following - maybe my brain is running slow, can you explain more? I'm also not sure we should document implementation details too much, PR_SET_PDEATHSIG should work regardless of how the host is implemented.

I also removed some of the comments for now. Maybe documenting such rare corner cases was overkill, it just leads to noise.

Why do we spawn this on a blocking pool?

I honestly don't know.

can you explain more?

Basti quoted documentation mentioning that your prctl call will terminate worker once parent thread terminates, not the node itself. Suppose pvf-future spawns worker and moves to a different tokio thread (tokio workers steal work). This thread terminating would mean that a worker would get killed, and, as Basti mentioned, you're not doing what you wanted to.

Then I noted that pvf future in fact is spawned on a blocking pool. Essentially, tokio::spawn_blocking(block_on(pvf_host)). Given this, your PR works. This thread terminating is equivalent of terminating pvf host, which is sufficient for worker termination. If we replace spawn_blocking with spawn, your PR doesn't work anymore, this should be documented.

I see, you are correct. I'm not sure how I feel about relying on it being a blocking pool, since that fact can change.

Does work-stealing mean that the thread that spawned work can actually die while the work is still active? That would surprise me since it seems unsound (precisely for cases like PR_SET_PDEATHSIG), but TBH I don't know how tokio's scheduler works under the hood.

Work stealing means each worker thread has its own queue and they steal tasks from each other, "work stealing" is a term used by golang scheduler and tokio, you can read about it.

The main point is that task can move to another thread. For example, pvf future is on thread A and blocked, thread A steals another future and panics.

node/core/pvf/common/src/worker.rs

s0me0ne-unkn0wn · 2023-05-26T11:18:39Z

I'd start a step back and try to reproduce the behavior when the node is killed, and workers persist. Did you ever come across such behavior? I don't really understand how it is possible. IIUC there are only two possible scenarios:

When the node is killed, the worker is waiting in blocking recv_request().await?. When the node dies, the OS frees its resources along with UNIX sockets, await returns Err, the loop is terminated, and the worker exits;
When the node is killed, the worker is doing something. In that case, it will finish what it is doing and after that will send_response().await? which will also result in loop termination as the socket doesn't exist anymore.

mrcnski · 2023-05-26T11:27:14Z

Agreed @s0me0ne-unkn0wn, it is hard to see how not killing the worker can lead to problems in practice. Killing it is just a standard hardening step though and it seemed simple enough to add.

A theoretical problem I noticed before this change is the race when killing the parent node in kill_parent_node_in_emergency. We get the parent pid and then send a kill signal to that pid, but between those steps the parent may have died and a different process been assigned that same pid. One way to fix that is by using pidfd's (another Linux-only feature), but what I did here seemed simpler and like a more general hardening step.

s0me0ne-unkn0wn · 2023-05-26T11:56:22Z

We get the parent pid and then send a kill signal to that pid, but between those steps the parent may have died and a different process been assigned that same pid.

Afaik PIDs are allocated in a sequence that only wraps around when some maximum PID number limit is reached. The highest PID on my system right now is 1555131, so it seems the limit is rather high. For the race condition to become a problem, it has to wrap around between two sequential syscalls (getppid() and kill()), which doesn't sound like a very realistic scenario, and killing the node is a very narrow corner case by itself. I'm not sure if any effort should be put into trying to prevent that 🤔

mrcnski · 2023-05-26T12:06:00Z

I agree it's a corner case. There have been exploits that benefited from leaky processes but I don't know the details of how they worked. I do have a vague idea of this being useful under system pressure where processes are dying and being restarted en masse. I also have a vague idea of a malicious PVF killing the parent node somehow so it can have more time to be malicious (though right now it can just spawn a new process to avoid the death signal).

But putting in PR_SET_PDEATHSIG is pretty standard and shouldn't be controversial, and can be a simple hardening step in the interim until full sandboxing. I can remove the comments I added if they are excessive noise.

stakeworld · 2023-05-27T18:40:37Z

I think this is identical to #5726 from a while ago. My idea is that it would be logical to kill the workers if they don't have a function anymore, maybe for security reasons but also just to tidy up the process overview

bkchr · 2023-05-27T20:01:09Z

I think this is identical to #5726 from a while ago. My idea is that it would be logical to kill the workers if they don't have a function anymore, maybe for security reasons but also just to tidy up the process overview

This is not related and as long as the node is an active validator we should not bring down the workers. However, killing them if the node is not part of the set makes sense.

bkchr · 2023-05-27T20:04:22Z

it has to wrap around between two sequential syscalls (getppid() and kill()), which doesn't sound like a very realistic scenario,

One thing that should be considered is that the returned parent process can change, if the parent was stopped in between. Another race could be that the node ends between starting the worker and registering PR_SET_PDEATHSIG

mrcnski · 2023-05-28T11:35:21Z

However, killing them if the node is not part of the set makes sense.

Might be clean to shut down the whole validation subsystem in that case, which should have the same effect. Can reopen that issue if we want this.

Another race could be that the node ends between starting the worker and registering PR_SET_PDEATHSIG

There are ways to prevent this race involving a pipe from the parent, but that requires extra code on the parent. It should be okay if we try to read from the socket later, get an error and exit...

eskimor · 2023-05-30T17:57:29Z

node/core/pvf/common/src/worker.rs

+
+		// Explicitly terminate the worker here. Not strictly necessary as it would end when it
+		// fails to read from the socket later.
+		libc::kill(worker_pid as i32, libc::SIGKILL);


Then why do it? Not saying we should not, but there should be some reason.

The idea was just to be explicit, and so we don't rely on coincidental architectural details that may change.

But, maybe we should instead have it be explicit that the worker dies when it can't read from the socket. The parent may also die for some reason other than us killing it.

I would say, if we already die when read fails, let's leave it at that, but document that error handling accordingly.

mrcnski requested review from koute, eskimor and s0me0ne-unkn0wn May 25, 2023 23:27

mrcnski self-assigned this May 25, 2023

Merge branch 'master' into mrcnski/pr_set_pdeathsig

b166d04

bkchr reviewed May 26, 2023

View reviewed changes

Update comments

fe56b90

mrcnski requested a review from slumber May 28, 2023 11:42

eskimor reviewed May 30, 2023

View reviewed changes

mrcnski mentioned this pull request Aug 8, 2023

Move ZSTD PVF decompression to a blocking task paritytech/polkadot-sdk#599

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PVF worker: Mitigate process leaks #7296

PVF worker: Mitigate process leaks #7296

mrcnski commented May 25, 2023 •

edited

Loading

bkchr May 26, 2023

bkchr May 26, 2023

mrcnski May 26, 2023

bkchr May 26, 2023

mrcnski May 26, 2023 •

edited

Loading

bkchr May 27, 2023

mrcnski May 28, 2023

slumber May 28, 2023

mrcnski May 28, 2023

slumber May 29, 2023

s0me0ne-unkn0wn commented May 26, 2023

mrcnski commented May 26, 2023

s0me0ne-unkn0wn commented May 26, 2023

mrcnski commented May 26, 2023

stakeworld commented May 27, 2023

bkchr commented May 27, 2023

bkchr commented May 27, 2023

mrcnski commented May 28, 2023

eskimor May 30, 2023

mrcnski May 30, 2023

eskimor May 30, 2023

PVF worker: Mitigate process leaks #7296

Are you sure you want to change the base?

PVF worker: Mitigate process leaks #7296

Conversation

mrcnski commented May 25, 2023 • edited Loading

PULL REQUEST

Overview

Sources

Related

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski May 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s0me0ne-unkn0wn commented May 26, 2023

mrcnski commented May 26, 2023

s0me0ne-unkn0wn commented May 26, 2023

mrcnski commented May 26, 2023

stakeworld commented May 27, 2023

bkchr commented May 27, 2023

bkchr commented May 27, 2023

mrcnski commented May 28, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mrcnski commented May 25, 2023 •

edited

Loading

mrcnski May 26, 2023 •

edited

Loading