-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arc::strong_count
memory ordering is a potential footgun
#117485
Comments
This can definitely be added to the docs, but it was intentional. The argument is that you shouldn't be synchronizing on the strong_count (you can't write a correct I know your code is probably convoluted for an example, but I'd argue you should use |
My use case was something like Regarding busy waiting, the code would still be wrong even if we used some kind of synchronized waiting inside inside the loop (example), since the final relaxed load and the non-atomic access after can be reordered. To fix the code, we need a barrier between the final |
Hmmm, this makes me wonder if there should be an arc without weak support. We'd need to check what other methods have their performance hindered by weak references. Overall, I think the use case makes sense, but having counts be relaxed is more flexible. I'd maybe argue using a fence is clearer code too. My main concern would be performance but as far as I understand a good microarchitecture won't have any trouble with an acquire fence. |
This change has actually broken one of our older services that was getting a long overdue update and was relying on |
Just to add another data point: the change to I'm not using that function myself, and I don't know if anyone is, so it might not be a problem at all. I was planning to replace |
Would docs suggesting an aquire fence if the count is equal to 1 help? (Assuming unsynchronized reads at that point.) The is_abandoned() function seems like a good example of where this change can lead to better performance (since you only need a fence when the other thread is dead which will happen once as opposed to a bunch of times when it's not dead). |
I'm not very familiar with the use of fences, so I'm not sure ... Can you please clarify what you mean? In my case, the count is only ever 2 or 1. If it's 1, it means that the other thread has been abandoned. I tried putting an Admittedly, I'm using an |
I just said:
But then I stumbled upon some notes I made earlier (mgeier/rtrb#75 (comment)) which are talking about the ThreadSanitizer creating false positives when using fences, see also #65097. But this would make it hard for me to know if I'm using the fence correctly ... and is there a fallback implementation for when using ThreadSanitizer? |
I trust miri: diff --git a/tests/lib.rs b/tests/lib.rs
index 9afd61d..7abf893 100644
--- a/tests/lib.rs
+++ b/tests/lib.rs
@@ -146,7 +146,9 @@ fn no_race_with_is_abandoned() {
unsafe { V = 10 };
drop(p);
});
+ while !c.is_abandoned() {}
if c.is_abandoned() {
+ std::sync::atomic::fence(std::sync::atomic::Ordering::Acquire);
unsafe { V = 20 };
} else {
// This is not synchronized, both Miri and ThreadSanitizer should detect a data race: If you want you can move the fence inside Also the code above is still "wrong" in that it relies on this line from rust/library/alloc/src/sync.rs Line 2427 in bf71dae
To see this move the drop before the unsafe and you'll get UB again (which can't be fixed with fences because now there's no condition variable to spin on and guarantee the fence gets executed). |
Thanks for the clarification!
That's good to know. However, I would be interested in not spinning. I would also like to generally be able to put something into the
But miri didn't detect the data race in the first place!
In case of the But anyway, that's something I can implement without |
The fence will work fine without spinning, but you'll never get true for the if statement using miri without it. Try my patch with miri and it will fail if you comment out the fence.
For push and pop, absolutely. But agree to disagree on the flag. Now of course if you document the flag as explicitly providing some synchronization, then sure. |
That's good to know! I'm trusting you, but there is currently no tool that would allow to check that, right? Miri seems to not detect the lack of the fence, and ThreadSanitizer seems to produce false positives when it is there.
OK, but is this a shortcoming of Miri (which might be fixed some day) or is this intentional?
Yes, thanks, that behaves as expected, but I'd be interested in testing the non-spinning use case. |
You shouldn't. 😉
Don't think so, no. But you can replace the dependency spinning with this:
and use the I think this is a pretty reasonable model of other people's code: I know there's a for loop in there which suggests spinning, but we're not touching rtrb code and it's a constant time loop.
Yeah, Miri (unlike tools like loom) is an interpreter so only runs through a single execution path.
I filed rust-lang/miri#3538. We should be able to get rid of the loop and have a single |
Ok we already have the solution from the other issue: slap a yield_now in there and run with |
@lukas-code did Miri originally find this issue? We're always looking for things to add to the "issues found by Miri" list :)
Note that you should test with a bunch of seeds to be sure of the fix, since evidently not every seed reproduces the issue. |
Yes it was found by miri! Unfortunately I don't have anything good to put on "issues found by Miri" list. The code originally came from an university assignment where we had to implement a SPSC and after finding out about this problem, I rewrote my code entirely to not rely on Interestingly, mgeier/rtrb#114 looks like it's the exact same issue as I had, but in a real SPSC: Unsynchronized access to the shared data by the reader after the writer has been dropped. |
Thanks for all your help @SUPERCILEX! However, when I try it locally on my computer, Miri doesn't detect the data race with I tried all values from 0 to 255 as suggested in the Miri docs, and those values do detect the data race:
So what should I do in my tests? And should I use |
When concurrency is involved, bugs can often only be found probabilistically, and that applies to Miri as well. There's no general answer for which flags have the highest chance of finding any particular bug, it depends on your code. When there's a 50% chance of finding a bug, around half the seeds will hit it; which exact seeds can find any particular bug changes with each nightly version and even when you update completely unrelated dependencies. One common thing to do is run the tests in a loop inside the code. Ideally you should use new atomic variables for each loop iteration (i.e., not just the same One day |
Thanks @RalfJung for the explanations, that's very helpful!
I wasn't aware of that! In case anyone is interested, I have updated my tests according to the recommendations: mgeier/rtrb#125 |
I would in general recommend not setting this flag for concurrency tests -- it makes tests more reproducible, but it also makes some issues entirely impossible to find. So for some bugs the probability they can be found goes up without preemption, but for other bugs the probability then goes to 0. |
I think this should be closed in favor of #126239. We should document what synchronization guarantees each construct has, but it's unreasonably to put warnings in all the documentation for every construct where users might have assumed (probably by peeking into the implementation and seeing Acquire/Release/SeqCst/etc. orderings being used) something that isn't guaranteed. |
In #115546 the memory ordering of
{Arc,Weak}::{strong,weak}_count
got changed toRelaxed
, which can cause some hard to debug race bugs.I tried this code:
I expected to see this happen: I expected the check
strong_count == 0
to be strong enough to avoid a data race in this program. My intuition was that this check implies that no other thread can access the data concurrently.Instead, this happened: With the fence commented out, the above program has undefined behavior. Miri outputs:
I think we should either put the
Acquire
ordering back or add a warning to the docs that such code requires a manual fence.Meta
1.74 = current beta
cc @SUPERCILEX author of #115546
@rustbot label T-libs A-atomic
The text was updated successfully, but these errors were encountered: