Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup and fix cycle handling #285

Merged
merged 86 commits into from
Jan 21, 2022
Merged

Conversation

nikomatsakis
Copy link
Member

@nikomatsakis nikomatsakis commented Oct 30, 2021

This PR is...kind of long. It reshapes the core logic of salsa to fix various bugs in cycle handling and generally simplify how we handle cross-thread coordination.

Best read commit by commit: every commit passes all tests, afaik.

The core bug I was taking aim at was the fact that, when you invoke maybe_changed_since, you can sometimes wind up detecting a cycle without having pushed some of the relevant queries onto the stack. This is now fixed.

From a user's POV, nothing changes from this PR, there are only minimal changes to the public interface. The biggest one is that recover functions now get a &salsa::Cycle which has methods for accessing the participants; the other is that queries that are participating in cycle fallback will use unwinding to avoid executing past the point where the cycle is discovered. Otherwise, things work the same as before:

  • If you encounter a cycle and all participant queries are marked with #[salsa::recover], then they will take on the recovery value. (At the moment, they continue executing after the cycle is observed, but their final result is ignored; I plan to change this in a follow-up PR, or maybe some future commit to this PR.)
  • If you encounter a cycle and some or all participants are NOT marked with #[salsa::recover], then the code panics. This is treated like any other panic, cancelling all other work.

Along the way, I made... a few... other changes:

  • Cross-thread handling is simplified. When we block on another thread, it no longer sends us a final result. Instead, it just gets re-awoken and then it retries the original request. This is helpful when you encounter cycles in maybe_changed_since vs read, but it's also more compatible with some of the directions I have in mind.
  • Cycle detection is simplified and more centrally coordinated. Previously, when a cycle was detected, we would mark all the participants on the current thread, but then we would mark other threads 'lazilly'. Now, threads move ownership of their stack into the shared dep graph when they block, so that we can mark all the stack frames at once. This also means less cloning on blocking, so it should be mildly more efficient.
  • The code is DRY-er, since maybe_changed_since has been re-implemented in terms of the same core building blocks as read (probe and friends). I originally tried to unify them, but I realized that they behave somewhat differently from one another and both of them make sense. (In particular, we want to be able to free values with the LRU cache while still checking if they are up to date.)

Ah, I realize now that I had planned to write a bunch of docs in the salsa book before I landed this. Well, I'm going to open the PR anyway, as I've let this branch go far too long.

r? @matklad

flodiebold and others added 30 commits October 30, 2021 11:07
If I am not in an error, all current tests uses "static" cycles -- a
cycle always present. Let's spice that up by adding conditional cycles:
cycles that appear only for specific impls
This allows us to figure out whether a query can recover
from a cycle (and how) without invoking the `recover`
function.
Find the cycle recovery strategy for a given DatabaseKey.
They now use signals to guarantee we are testing the code paths
we want to be testing.
Rather than checking return value of from `Q::cycle_fallback`, we
now consult the computed recovery strategy to decide whether to
panic or to recover. We can thus assume that we will successfully
recover and don't need to check for `None` results anymore.
Before we could not observe the case where:

* thread A is blocked on B
* cycle detected in thread B
* some participants are on thread A and have to be marked

(In particular, I commented out some code that seemed necessary and
didn't see any tests fail)
Instead of sending the result back, just have the waiting threads retry
reading the cache.
Being generic over the keys made code harder to read.
Make `add_edge` infallible
This will allow me to add condvar logic to it
We are going to make it so that the runtime
coordinates delivery of the WaitResults.
Thi sis an intermediate step towards having the runtime
coordinate wakeups.
Instead of creating a future for each edge
in the graph, we now have all dependent queries
block in the runtime and wake up whenever results
are published to see if their results are ready.

We could certainly allocate a CondVar for each dependent
query if we found that spurious wakeups were a problem.
I consider this highly unlikely in practice.
Currently, when one thread blocks on another, we clone the
stack from that task. This results in a lot of clones, but it also
means that we can't mark all the frames involved in a cycle atomically.
Instead, when we propagate information between threads, we also
propagate the participants of the cycle and so forth.

This branch *moves* the stack into the runtime while a thread
is blocked, and then moves it back out when the thread resumes.
This permits the runtime to mark all the cycle participants at once.
It also avoids cloning.
I was finding the parallel test setup hard to read,
so everything relating to one test is now in a single
file, with shorter names.
The previous version had a lot of spurious wakeups, but it
also resisted the refactoring I'm about to do. =)
Including the corner case where the active thread does not have
recovery.
@nikomatsakis
Copy link
Member Author

@matklad ok, the code now supports cycle fallback so long as any one query has it. I also added tests for all the wicked situations I could come up with. The main thing that's not done is updating the RFC and Salsa book to describe how the algorithm works.

@nikomatsakis
Copy link
Member Author

Although in the process of writing those comments I already found one bug (the fix for which is already part of these commits), so I wouldn't be surprised if doing more writing helped to flush out more bugs.

When a query Q invokes a cycle Q1...Q1 but Q is not a
participant in that cycle, Q should not recover! Test that.
@nikomatsakis
Copy link
Member Author

OK, the RFC is more up to date now.

src/runtime.rs Outdated Show resolved Hide resolved
@nikomatsakis
Copy link
Member Author

@matklad I believe I have addressed all of your review comments.

@nikomatsakis
Copy link
Member Author

I discussed this with @matklad and we agreed to merge this PR.

bors r+

@bors
Copy link
Contributor

bors bot commented Jan 21, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants