Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prove the correctness of a work-stealing deque #26

Open
wants to merge 34 commits into
base: master
Choose a base branch
from

Conversation

jeehoonkang
Copy link
Contributor

@jeehoonkang jeehoonkang commented Jan 8, 2018

Rendered
Implementation

This is a linearization proof of the Chase-Lev work-stealing deque. To the best of my knowledge, it is the first publicly available linearization proof (attempt) of the Chase-Lev deque :)

@ghost
Copy link

ghost commented Jan 8, 2018

Hey, this looks amazing! :) I haven't read the proof in detail yet, but here go a few quick comments/questions:

  1. Are you going to eventually turn this RFC into a research paper?
  2. Is anyone outside the Crossbeam project aware of this proof? If so, have they read/verified it?
  3. Do you have any benchmark numbers comparing the current and the new deque implementation?
  4. Have you tried verifying the new implementation using CDS Checker? I believe it'd be easy to modify this C++ code to use the new memory orderings and run the checker on it.

@jeehoonkang
Copy link
Contributor Author

jeehoonkang commented Jan 9, 2018

I'd like to write a paper out of https://github.com/jeehoonkang/crossbeam-rfcs/blob/deque-proof/text/2017-07-23-relaxed-memory.md and this. Some of my colleagues in academia (including my supervisor) are also reading this proof.

I don't have any benchmark numbers yet.. I'll prepare soon before merging crossbeam-rs/crossbeam-deque#2.

I couldn't properly run CDSChecker.. I followed the instruction in https://github.com/computersforpeace/model-checker, but it always say OUT OF BOOTSTRAP MEMORY when running ./run.sh. Do you have any idea?

pub fn steal(&self) -> Option<T> {
'L401: let mut t = self.top.load(Relaxed);

'L402: let guard = epoch::pin_fence(); // epoch::pin(), but forces fence(SeqCst)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between epoch::pin_fence(); // epoch::pin(), but forces fence(SeqCst) and epoch::pin? pin already uses fence(SeqCst), but on x86 it uses lock cmpxchg instead, is that the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

epoch::pin() is reentrant, and if it is re-entering, it doesn't issue fence(SeqCst). Here, I needed to issue regardless of whether it is re-entering or not. Maybe it's not clear from the text. I'll revise it.


In the C/C++11 standards, if two threads race on a non-atomic object, i.e. they concurrently access
it and at least one of them writes to it, then the program's behavior is undefined. Unfortunately,
in fact, `fn push(): 'L107` and `fn steal(): 'L408` may race on the contents of the `buffer`. For
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it definitely push: 107 and steal: 408 that race? They are both reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! it's L109 and L409. Thanks!

Copy link
Member

@Vtec234 Vtec234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for laying formal foundations for Crossbeam!

@jeehoonkang
Copy link
Contributor Author

jeehoonkang commented Jan 9, 2018

I just fixed a few bugs, but I don't think all the bugs are squashed right now. I'll reread the proof 2-3 more times.

@jeehoonkang jeehoonkang force-pushed the deque-proof branch 2 times, most recently from d54a0b4 to 8cf91e5 Compare January 13, 2018 07:13
@jeehoonkang
Copy link
Contributor Author

I think the RFC itself is now ready to be merged. I checked the proof several times, and updated it. @stjepang please have a look!

Though the implementation needs some more work:

  • Benchmark: Currently crossbeam-deque doesn't have any benchmark. It's better to make one, and see how this RFC improves (or degrades!) the performance.
  • Model checking: it's better to run CDSChecker or other model checkers against the proposed implementation.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I understood the overall strategy of the proof, but don't feel competent to fully grasp it and vouch for its correctness. This new orderings seem like a pure win over the current implementation, especially so on weakly-ordered architectures. The only blocker is some kind of affirmation that the proof is correct - either through formal verification or through peer review.

@jeehoonkang, how confident are you about this proof? Do you think we should wait for further review/verification? Keep in mind crossbeam-deque is supposed to eventually go into Firefox as a Rayon dependency, so it's kind of important that we get it right. In any case, I'll trust your judgment. :)

I really like the pin_fence function - seems like a much nicer solution that my if epoch::is_pinned() { fence(SeqCst) } hack. If you agree, I'd like to delete is_pinned and replace it with pin_fence.

is necessary because `fn push(): 'L109` and `fn steal(): 'L409` may concurrently access the contents
of the `buffer`, while the former is writing to it. For example, the scheduler may stop a `steal()`
invocation right after `'L402` so that `t` read in `'L401` may be arbitrarily stale. Now, suppose
that in a concurrent `push()` invocation, `b` equals to `t + buffer.get_capacity()` and it is
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be t + buffer.get_capacity() - 1? Because if it is equal to t + buffer.get_capacity() that means the deque is full and the push operation cannot proceed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push() first writes to buffer[b % sizeof(buffer)] and then increases bottom, so the case I'd like to consider here is b = t + sizeof(buffer).

It is possible in the presence of resizing, because the top variable can be advanced while the stealer is stalled by the scheduler, so that t (the old value) is much less than top (the current value).

the contents inside a buffer is always accessed modulo the buffer's capacity (`'L109`, `'L211`,
`'L409`) and the buffer's size is always nonzero, there are no buffer overruns.

Thus it remains to prove that the buffer is not used after freed. Thanks to Crossbeam, we don't need
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/used after freed/used after it has been freed/

We will insert the invocations in `G_i` between `O_i` and `O_(i+1)`. Inside a group `G_i`, we give
the linearization order as follows:

- Let `STEAL^x` be the set of steal invocations tat stole an element at the index `x`, and
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/tat/that/

succeeds or fails, `O_i` reads or writes `top >= x`.

It is worth nothing that for this lemma to hold, it is necessary for the CAS at `'L213` to be
strong, i.e. it the CAS does not spuriously fail.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/it the CAS/the CAS/

the companion implementation. This C11 requirement may be fail-safe for most use cases, but can
actually be slightly inefficient in this case.

It is worth nothing that the CAS at `'L213` should be strong. Otherwise, a similar execution to the
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/nothing/noting/

@jeehoonkang
Copy link
Contributor Author

jeehoonkang commented Jan 18, 2018

@stjepang Thanks! I revised the document as you mentioned.

  • I'm confident with this proof. But since it will be deployed in production (Firefox), it's better to peer-review this proof, at least. Before merging it, I'd like to wait for my supervisor @gilhur to finish reading this proof.

  • This new orderings seem like a pure win over the current implementation, especially so on weakly-ordered architectures.

    I thought the same thing, but it turns out that the original and the new version makes almost identical x86 and ARM binary. In particular, I thought that a CAS w/ release/acquire orderings for success/failure cases is more efficient than CAS w/ seqcst/relaxed, but they are actually compiled to the same instruction in x86 and ARMv8. (Also in ARMv7 for GCC, while theoretically the former can be more efficiently compiled).

    As you suggested in IRC, I'll benchmark this branch w/ rayon.breadth_first(). I don't expect a huge win in even weaker architectures, though.

    Still I believe it is good to merge this branch, because it'll give me a better chance for paper acceptance it more clearly reveals the synchronizations conducted in this data structure. In particular, in my opinion, it's hard to understand the meaning of SeqCst load/store/rmw. What do you think?

  • If you agree, I'd like to delete is_pinned and replace it with pin_fence.

    I think we can do it without an RFC, since it doesn't change the existing API. But maybe pin_fence is not the best name for it. Here I'll trust your judgment :)

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, sounds good! Let us know when your supervisor approves it.

I've tried fiddling with CDSChecker and didn't find any bugs with your memory orderings. Can you try running the checker as well? Here's what you need to do:

  1. git clone git://demsky.eecs.uci.edu/model-checker.git
  2. cd model-checker
  3. git clone git://demsky.eecs.uci.edu/model-checker-benchmarks.git benchmarks
  4. make
  5. make benchmarks
  6. ./benchmarks/chase-lev-deque-bugfix/main
  7. Edit ./chase-lev-deque-bugfix/deque.c and relax the orderings
  8. make benchmarks
  9. ./benchmarks/chase-lev-deque-bugfix/main

Also try playing with these flags:

  • ./benchmarks/chase-lev-deque-bugfix/main -m 2 -y
  • ./benchmarks/chase-lev-deque-bugfix/main -m 2 -f 10
  • ./benchmarks/chase-lev-deque-bugfix/main -m 50 -f 50

@ghost
Copy link

ghost commented Jun 16, 2018

@jeehoonkang Any news on this issue? How's the review of this proof going?

@jeehoonkang
Copy link
Contributor Author

Sorry for inactivity these days. My colleage and I are working on a different paper, so revising this proof had been postponed probably until mid July. I guess the merge of this PR is not urgent, right? I'll definitely come back and merge this PR :)

@ghost
Copy link

ghost commented Jun 17, 2018

Just checking, take your time. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants