-
-
Notifications
You must be signed in to change notification settings - Fork 638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port nailgun client to rust #10865
Port nailgun client to rust #10865
Conversation
398d719
to
3a77186
Compare
self_process = psutil.Process() | ||
children = self_process.children() | ||
logger.debug(f"Sending SIGINT to child processes: {children}") | ||
logger.warning(f"Sending SIGINT to child processes: {children}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to replace SIGINT with the _received_signal or signal.strsignal(_received_signal) or signal.getsignal(_received_signal).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
signal.strsignal
apparently only exists after python 3.8
self_process = psutil.Process() | ||
children = self_process.children() | ||
logger.debug(f"Sending SIGINT to child processes: {children}") | ||
logger.warning(f"Sending SIGINT to child processes: {children}") | ||
for child_process in children: | ||
child_process.send_signal(signal.SIGINT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/signal.SIGINT/_received_signal/
def __init__(self, pailgun_client: NailgunClient, pid: int, timeout: float = 1): | ||
self._pailgun_client = pailgun_client | ||
self._timeout = timeout | ||
def __init__(self, pid: int): | ||
self.pid = pid | ||
super().__init__(pantsd_instance=False) | ||
|
||
def _forward_signal_with_timeout(self, signum, signame): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to rename this helper to _forward_signal
since it no longer deals with timeouts.
def _backoff(attempt): | ||
"""Minimal backoff strategy for daemon restarts.""" | ||
time.sleep(attempt + (attempt - 1)) | ||
|
||
def run(self) -> ExitCode: | ||
"""Runs pants remotely with retry and recovery for nascent executions.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe lop off the second half of the doc sentence.
pantsd_signal_handler = PailgunClientSignalHandler(pid=pid) | ||
|
||
with ExceptionSink.trapped_signals(pantsd_signal_handler), STTYSettings.preserved(): | ||
return cast(int, rust_nailgun_client.execute(signal_fn, command, args, modified_env)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to stick to the alias and cast to ExitCode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @gshuflin !
src/rust/engine/nailgun/src/lib.rs
Outdated
@@ -27,6 +27,8 @@ | |||
// Arc<Mutex> can be more clear than needing to grok Orderings: | |||
#![allow(clippy::mutex_atomic)] | |||
#![type_length_limit = "2058438"] | |||
#![allow(dead_code)] | |||
#![allow(unused_variables)] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is already pretty dense: it would be good to split it into client.rs
and server.rs
probably, loaded by a mostly empty lib.rs
.
src/rust/engine/nailgun/src/lib.rs
Outdated
#![allow(dead_code)] | ||
#![allow(unused_variables)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be removed in a few different files where they were added.
for child_process in children: | ||
child_process.send_signal(signal.SIGINT) | ||
|
||
def handle_sigint(self, signum: int, _frame): | ||
logger.warning("Calling handle_sigint in pantsd") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any process that loads the ExceptionSink will end up with this handler installed, so this message might not always be accurate.
reason=KeyboardInterrupt("Sending user interrupt to pantsd"), | ||
) | ||
self._pailgun_client.maybe_send_signal(signum) | ||
ExceptionSink._signal_sent = signum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than reaching out to do this explicitly, should this handler be calling super()
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think it's a mistake for PailgunClientSignalHandler
and SignalHandler
to have a subclass-superclass relationship actually. If we were going to keep this code long term I would probably want to rewrite SignalHandler
as an "interface"-like superclass and have the remote and non-remote cases subclass it separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. You'll probably want to expose some other API to set the signal then, rather than accessing the private field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also recommend adding a comment docstring to ExceptionSink._signal_sent
, as we appear to be depending on it here, and it appears to be the only class-level field that does not have a comment to document it.
deprecation_start_version="", | ||
help='["DEPRECATED: the pailgun client has been rewritten to no longer use this"].' | ||
"The length of time (in seconds) to wait for further output after sending a " | ||
"signal to the remote pantsd process before killing it.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deprecation_start_version="", | |
help='["DEPRECATED: the pailgun client has been rewritten to no longer use this"].' | |
"The length of time (in seconds) to wait for further output after sending a " | |
"signal to the remote pantsd process before killing it.", | |
removal_version="2.1.0.dev0", | |
removal_hint="the pailgun client has been rewritten to no longer use this", | |
help="The length of time (in seconds) to wait for further output after sending a " | |
"signal to the remote pantsd process before killing it.", |
src/rust/engine/nailgun/src/lib.rs
Outdated
Either::Right((exited, execution_result_future)) => { | ||
Err("Exiting nailgun client future via explicit quit message".to_string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is handling the case where ExceptionSink._signal_sent
ends up set?
This is an interesting situation, and rolls back to what we had discussed about replacing the weird python nailgun "timeout" thing with "send SIGINT to the server the first time someone ctrl+c's and SIGTERM the second time". I think that the exact mechanism that might be good here is:
- when the client receives the first SIGINT, it should send SIGINT to the server, but not kill itself... this will allow it to continue to receive data as the server shuts down (and in a heartbeat future, for it to drain the socket after the server shuts down cleanly when heartbeats stop arriving)
- when the client receives the second SIGINT, it should SIGTERM to the server, and then set this exit condition (and in a heartbeat future, only kill itself without waiting for the server to notice, and rely on the client going away causing the server to tear down).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, but I'd rather implement it as part of the commit that implements heartbeats, in the interest of getting this (now rather lengthy) PR merged.
@@ -33,6 +33,8 @@ | |||
clippy::transmute_ptr_to_ptr, | |||
clippy::zero_ptr | |||
)] | |||
#![allow(dead_code)] | |||
#![allow(unused_variables)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, here and elsewhere.
data executor: PyExecutor; | ||
data port: u16; | ||
|
||
def execute(&self, py_signal_fn: PyObject, command: PyString, args: PyList, env: PyDict) -> CPyResult<PyInt> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cpython
crate can do more complicated argument transforms (via FromPyObject: note the implementations), so I think that this signature can be at least:
def execute(&self, py_signal_fn: PyObject, command: PyString, args: PyList, env: PyDict) -> CPyResult<PyInt> { | |
def execute(&self, py_signal_fn: PyObject, command: String, args: Vec<String>, env: PyDict) -> CPyResult<PyInt> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok: then regarding my comment here: #10865 (comment), it sounds like the massaging into rust types is already done, and you can probably ignore that comment safely.
let output = loop { | ||
let event = receiver.recv_timeout(timeout); | ||
if let Some(_termination) = maybe_break_execution_loop(&python_signal_fn) { | ||
break Err("Quitting becuase of explicit interrupt".to_string()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/becuase/because/
let _spawned_fut = executor.spawn(async move { | ||
let exit_code = nailgun_fut.await; | ||
let _ = sender.send(exit_code); | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should not need to spawn a task. If you got an error like "not running on the tokio
runtime", you should "enter" the runtime (which sets some thread-local state that tokio uses to find the runtime):
let result = executor.enter(|| {
// ... do some stuff on the runtime
})
But also, using executor.block_on
should enter the runtime... so I think that you might be able to replace most of this function body with:
let exit_code = executor.block_on(nailgun_fut);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few comments asking about how we know that some futures won't be leaked, and a few asking for a TODO to wrap up the tty/no tty complexity into an enum
. Feel free to answer or not, and to add TODOs or not, at your own discretion! ^_^
I love this work, it looks great, and the above comments were mostly trying to make it easier for me or others to work with and extend it once we have our great rust client! None of them should be considered a blocking change.
@@ -59,14 +59,16 @@ def _toggle_ignoring_sigint(self, toggle: bool) -> None: | |||
with self._ignore_sigint_lock: | |||
self._ignoring_sigint = toggle | |||
|
|||
def handle_sigint(self, signum: int, _frame): | |||
def _send_signal_to_children(self, received_signal: int) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a tiny docstring here? Something like:
def _send_signal_to_children(self, received_signal: int) -> None:
"""Send a signal to any children of this process in order.
Pants may have spawned 0, 1, or multiple subprocesses via Python or Rust.
Upon receiving a signal, this method is invoked to propagate the signal to all children,
regardless of how they were spawned.
"""
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _send_signal_to_children(self, received_signal: int) -> None: | |
def _send_signal_to_children(self, signum: int, signame: str) -> None: |
This would allow us to include the direct signal name in the debug message, instead of having to look it up.
@@ -635,6 +635,8 @@ def register_bootstrap_options(cls, register): | |||
advanced=True, | |||
type=float, | |||
default=5.0, | |||
removal_version="2.1.0.dev0", | |||
removal_hint="The pailgun client has been rewritten to no longer use this", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removal_hint="The pailgun client has been rewritten to no longer use this", | |
removal_hint="The pailgun client now uses a more robust API which avoids the need for any timeout logic.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we could also remove the TODO(#7514)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'll have to reword this again since cf. stu's above comment we do need some kind of timeout logic after all.
self_process = psutil.Process() | ||
children = self_process.children() | ||
logger.debug(f"Sending SIGINT to child processes: {children}") | ||
logger.debug(f"Sending signal number {received_signal} to child processes: {children}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.debug(f"Sending signal number {received_signal} to child processes: {children}") | |
logger.debug(f"Sending signal {signame} ({signum}) to child processes: {children}") |
ExceptionSink._signal_sent = signum | ||
self._send_signal_to_children(signum) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._send_signal_to_children(signum) | |
self._send_signal_to_children(signum, 'SIGINT') |
For example.
reason=KeyboardInterrupt("Sending user interrupt to pantsd"), | ||
) | ||
self._pailgun_client.maybe_send_signal(signum) | ||
ExceptionSink._signal_sent = signum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also recommend adding a comment docstring to ExceptionSink._signal_sent
, as we appear to be depending on it here, and it appears to be the only class-level field that does not have a comment to document it.
} | ||
|
||
/// | ||
/// Corresponds to `ttynames_to_env` in `nailgun_protocol.py`. See this struct's rustdocs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "corresponds" mean that it reads the env vars which are set in that method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No idea, I'm just moving this code around. I'd prefer to keep any proposed changes to server.rs
out of scope for this PR, since this is just moving code that used to exist in lib.rs
into a separate file.
} | ||
}; | ||
debug!("Sending message to nailgun client task to exit."); | ||
match exit_sender.send(()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful to add a TODO which wraps up the exit sending here. From looking at the code, it's not immediately clear to me what exit_sender.send())
does compared to seeing sender.send(exit_code)
above. Maybe just comments or renaming these variables might be helpful, but it's confusing to see the exit_sender
not sending the exit_code
, for example.
@@ -715,7 +810,7 @@ fn nailgun_server_await_shutdown( | |||
executor_ptr: PyExecutor, | |||
nailgun_server_ptr: PyNailgunServer, | |||
) -> PyUnitResult { | |||
with_executor(py, executor_ptr, |executor| { | |||
with_executor(py, &executor_ptr, |executor| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Thank you for looking into this!
@@ -29,7 +29,7 @@ def test_pantsd_file_logging(self) -> None: | |||
["--backend-packages=pants.backend.python", "list", "3rdparty::"] | |||
) | |||
ctx.checker.assert_started() | |||
assert "[DEBUG] connecting to pantsd on port" in daemon_run.stderr_data | |||
assert "[DEBUG] Connecting to pantsd on port" in daemon_run.stderr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change! Thanks!
data executor: PyExecutor; | ||
data port: u16; | ||
|
||
def execute(&self, py_signal_fn: PyObject, command: String, args: Vec<String>, env: PyDict) -> CPyResult<PyInt> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider moving the body of this method into a private function just above, e.g.
fn _execute_nailgun_client(client: &PyNailgunClient, py_signal_fn: Value, command: String, args: Vec<String>, env: Vec<String, String>) -> i64 { ... }
The benefit of that is that it would decouple the "massage cpython objects into rust" part from the nailgun execution. Since this is within a macro definition, it seems like there is benefit to reducing the amount of logic defined inline. Totally subjective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine, see Stu's comment on argument transforms from cpython here: #10865 (comment).
5a639ca
to
d777079
Compare
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
[ci skip-build-wheels]
[ci skip-build-wheels]
# Building wheels and fs_util will be skipped. Delete if not intended. [ci skip-build-wheels]
This reverts commit ee50791. [ci skip-build-wheels]
Internal only changes left off from the changelog: * Use cpython types in Rust functions that manipulate python objects (#10942) `PR #10942 <https://github.com/pantsbuild/pants/pull/10942>`_ * update libz-sys version to fix macOS compile error (#10941) `PR #10941 <https://github.com/pantsbuild/pants/pull/10941>`_ * Upgrade to Rust stable 1.47.0. (#10933) `PR #10933 <https://github.com/pantsbuild/pants/pull/10933>`_ * Finish CreateDigest Directory cleanup. (#10935) `PR #10935 <https://github.com/pantsbuild/pants/pull/10935>`_ * Hotfix broken import from merge conflict (#10934) `PR #10934 <https://github.com/pantsbuild/pants/pull/10934>`_ * Revert "Port nailgun client to rust (#10865)" (#10929) `PR #10929 <https://github.com/pantsbuild/pants/pull/10929>`_ * An ExternalTool for downloading the grpc_python_plugin. (#10927) `PR #10927 <https://github.com/pantsbuild/pants/pull/10927>`_ * Port nailgun client to rust (#10865) `PR #10865 <https://github.com/pantsbuild/pants/pull/10865>`_ * print stacktraces during import errors (#10906) `PR #10906 <https://github.com/pantsbuild/pants/pull/10906>`_ * fs.Digest is declared in Rust (#10905) `PR #10905 <https://github.com/pantsbuild/pants/pull/10905>`_ * add requests_ca_bundle to settable_env_vars (#10909) `PR #10909 <https://github.com/pantsbuild/pants/pull/10909>`_ [ci skip-rust]
…sbuild#10929)" This reverts commit dbf744f. [ci skip-build-wheels]
### Problem #10865 previously landed to port Pants' nailgun client to Rust. It was reverted in #10929 due to an issue with TTY access, where (in particular), the `repl` goal was mostly unresponsive. ### Solution The unresponsive `repl` was due to a bug in the `nails` library, where `stdin` was being consumed eagerly regardless of whether the server signaled that it would like to receive `stdin`. Pants sends an environment variable to the server that indicates which TTY the client is connected to, and the server will directly connect to that TTY if it can. When the server directly connects to the client's TTY, it does not accept `stdin`, but since `stdin` was read eagerly by the `nails` client (and ending up stuck in a buffer, since the server would not request it), the result was two different processes reading `stdin` from the TTY: the client, and the server. Bump to `nails` `0.7.0`, which [makes `stdin` initialization lazy](stuhood/nails@6b8c19a).
This commit ports the nailgun client used in
remote_pants_runner.py
from Python to Rust, using thenails
crate. This is in preparation for using the Rust nailgun library's heartbeat functionality to replace the current UNIX signal-based mechanism for the client to inform pantsd that a given pants run has been stopped, although this commit does not yet remove that signal-based code.Removing the Python version of the client allows a lot of Python code around the client side of the nailgun protocol to be removed, and also results in the removal of the logic for retrying connecting to the server on failure. Since a lot of this code is old, and predates the current design of pantsd, I think it makes sense to remove this code now, and add back any retry logic we see as still necessary in future commits.