-
-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Counters can have dangling pointers to workunits #11548
Comments
Interesting. So as an example, to increment counters based on a result of some operation, the closure could |
Also, how should this handle counter increments outside of the direct call to Could or should there be a helper to retrieve the currently-active workunit from a task-local? (similar to what is done for workunit state) |
Not quite... the closure itself cannot await anything. The future it returns can though. The lifetimes should work out because
This would not allow the compiler to check the lifetime, I don't think. A |
(Stu's answers are more accurate.)
Yeah, pass down to whatever helper functions are necessary downstream. Ack that this is less ergonomic, but similar to the arguments of Rust vs. C, that's worth imo it so that we don't need to reason about the correctness of the workunit lifetime.
|
I guess I treat an async closure as the same as a closure which can return an async block and do its work in that async block. "Same difference." |
Right, for example, pants/src/rust/engine/fs/store/src/local.rs Lines 303 to 308 in 9dce4b4
Would |
To clarify, for cases where it is not "easy" to pass in the workunit directly. The |
No, it suffers from the same dangling pointer issue. You would be able to get a |
Solving this is fairly challenging due to how lifetimes work with async closures, unfortunately. I need to put it down for now (and @Eric-Arellano has a workaround). A few potential approaches:
|
As found in #11548, the remote cache write counters were missing because they referred to a workunit that was already complete. This works around that by adding a new workunit in the async block. We also add a new workunit for local cache reads. Technically, we don't need this, but it will be necessary if/when we land the proposed fix in #11548 because we won't have a way of passing the parent workunit to the `ComandRunner.run()` method, given its trait signature that we can't change.
Add an in_workunit! macro to allow for mutable access to the created workunit while it runs. Fixes #11548. [ci skip-build-wheels] Co-authored-by: Eric Arellano <[email protected]>
Problem
#11479 made cache writes be async. The counters work for when cache writes start, but they are missing for cache writes finishing and for cache write errors:
pants/src/rust/engine/process_execution/src/remote_cache.rs
Lines 478 to 510 in 730417e
We know the cache write is actually finishing. When adding log statements, 28/29 of the cache writes finished termination before the end of the Pants run (no pantsd) in a test with local remote caching. Further, in production, when we rerun a CI shard, the remote cache gets used.
We also know that the global counters hashmap is being incremented.
pants/src/rust/engine/workunit_store/src/lib.rs
Lines 872 to 874 in 730417e
Adding this diff to
WorkunitStore::increment_counter()
:Results in logs like:
Interestingly, the RemoteCacheWriteFinished entries are persisting in the hashmap, whereas the other counters are being removed. This log comes near the end of the run, where we have already encountered
LocalExecutionRequests
andRemoteCacheWriteStarted
>15 times. Those are being removed because of this code, which moves the counters from the global hashmap to the specific workunit:pants/src/rust/engine/workunit_store/src/lib.rs
Lines 605 to 615 in 730417e
In contrast, the cache write finish metrics are never being consumed. This is because the workunit being associated with the async task has already completed, so nothing tries to consume the counter. Indeed, wrapping the spawned task in a new workunit fixes this:
While we can fix this by applying that diff, this speaks to a gotcha. It is possible to increment counters that refer to a dangling workunit, so will never be consumed. Currently, the developer must manually reason about the lifecycle of the associated workunit, rather than leveraging Rust's lifecycles to guarantee the code will work.
Rejected solution: global counters
We could fix this by not associating counters with a particular workunit; then, it wouldn't matter if the workunit has already finished.
We reject this because of the expected utility we'll have with associating counters to specific workunits.
Proposed solution
Leverage Rust's safety by using the type system to express that counters have the same lifecycle as their corresponding workunit. Rather than using a global hashmap, then linking it to the relevant workunit upon workunit completion, we would directly increment the counters on the workunit itself.
Workunits already store counters:
pants/src/rust/engine/workunit_store/src/lib.rs
Lines 72 to 79 in 730417e
Currently, call sites increment counters like this:
pants/src/rust/engine/process_execution/src/remote_cache.rs
Lines 498 to 500 in 730417e
Which then looks up the current workunit for that task, and uses the global hashmap to try to automatically associate the counter with the correct workunit.
Instead, we would change
with_workunit
so that instead of taking aFuture
as an arg, it takes a function that takes a mutable reference to aWorkunit
, and gives back aFuture
. In thisFuture
, the caller can directly increment counters on the workunit. For example:This guarantees that the lifecycle of the workunit and its corresponding metrics is the same, thanks to the Rust compiler's rules.
The text was updated successfully, but these errors were encountered: