Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Various race conditions reported by TSAN #199

Closed
Civil opened this issue Jul 5, 2016 · 16 comments
Closed

Various race conditions reported by TSAN #199

Civil opened this issue Jul 5, 2016 · 16 comments

Comments

@Civil
Copy link

Civil commented Jul 5, 2016

https://gist.github.com/Civil/e4aaffb76be7e216cc4f259e162b70b9 - TSAN report. Around 40 unique places where they happens.

Also please note that all TSAN stuff is 100% bugs - e.x. you are doing writes while having read mutex, atomic and not atomic writes from different threads at the same time, etc.

So to be sure that everything is correct, it's absolutely must to fix ALL of them.

@grobian
Copy link
Owner

grobian commented Jul 5, 2016

Right, so the gcc atomic instructions are races?
Most of them look like false positives, which the empirical proof also suggests.

@dgryski
Copy link
Contributor

dgryski commented Jul 5, 2016

Any value accessed from multiple threads needs protection with either a mutex or to always be accessed with atomic intrinsics.

TSAN does not have any false positives. If it detects a race, then there really was a race: two threads tried to access the same piece of memory, and at least one was a write.

https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

@Civil
Copy link
Author

Civil commented Jul 5, 2016

Also please note that it takes some time to actually understand the traces. All stuff that's about "gcc atomics access the value" is there mostly because there are places where the same value accessed in non-atomic way.

@grobian
Copy link
Owner

grobian commented Jul 5, 2016

Yes, but unless I misunderstand what it warns about, in this case it doesn't matter that it races, because the code uses optimistic concurrency.
If a counter increment fails, meh. If an availability check fails while it could have succeed, meh, better luck next time.

@dgryski
Copy link
Contributor

dgryski commented Jul 5, 2016

The compiler generates correct code assuming there is no concurrent access to the values. Higher optimization levels or newer compilers can easily spill random values to memory as long as everything is back where it should be when the basic block is finished.

For example,

int done;

void func() {
   while(!done) {
    ....
   }
}

Assuming there is no access to done inside the while loop, the compiler is justified in replacing while(!done) with while(1), since that is the behaviour with no concurrent access.

Out-of-order execution, memory flushes, and silent memory corruption can lead to even more obscure issues.

TSAN warnings need to be fixed.

@Civil
Copy link
Author

Civil commented Jul 5, 2016

Also most of the warnings are about setting variable with CAS, while second thread sets it without CAS, e.x.:
https://github.com/grobian/carbon-c-relay/blob/master/dispatcher.c#L579
https://github.com/grobian/carbon-c-relay/blob/master/dispatcher.c#L474

This will result it undefined behavior and the code can work correctly with one version of compiler, but even with minor increment of version can break in unpredictable way.

@grobian
Copy link
Owner

grobian commented Jul 6, 2016

Explain to me how that can cause an issue, and I'll fix it. The only thing that needs to be prevented there is that multiple threads assign the same connection to itself, hence the atomic compare and swap, because I didn't want to pthread lock it. So, please enlighten me how in this scenario the releasing of the connection by the thread itself can cause a race condition.

@Civil
Copy link
Author

Civil commented Jul 6, 2016

  1. It's undefined behavior from compiler's point of View. You can't be sure that compiler won't optimize out some of your code.
  2. You can't be sure that compiler will produce proper cache invalidation. Imagine if one thread thinks that value is 0 and other that it's 1.
  3. When in one thread you are doing CAS and in other just a normal assignment (as in my examples, this code is proven to ran in different threads) result is undefined.

Without fixing this you can never be sure that aggregation code produces correct output and your data is mathematically correct which kinda good to have.

If you don't trust Damain or me, I'd suggest you to open a discussion in gcc maillist for example, about how correct your assumptions are at this moment.

@grobian
Copy link
Owner

grobian commented Jul 7, 2016

As far as I understand, we're talking about a selection mechanism for threads to handle a certain connection or not. The aggregation code doesn't use these, but uses full pthread locks instead, so I don't yet understand why you think the aggregator produces mathematically invalid results.

@Civil
Copy link
Author

Civil commented Jul 10, 2016

You have various of undefined behaviors during writes and reads all over the aggregator pipeline, starting from parsing, and up to expire and dispatching. Because of the nature of that issues, you can't guarantee that the data will reach the aggregator at all. That's why that given some input you can't guarantee that your output will be the same everytime and that it'll be correct at all. Also behavior may change when you migrate from one version of compiler to another.

That's why there is very high chance that current aggregations are mathematically incorrect at least in some cases.

Yet I understand that you were saying about optmistic concurency, but the output of TSAN says that you are doing it wrong in some places (if you access to the variable with atomic operations you should do that everywhere, if you do atomic in one function and non-atomic in another, result is unpredictable).

And again, if you don't trust me or damian, I'd suggest you to open a talk in gcc maillist for example, provide them examples of code and ask if it's correct.

@grobian
Copy link
Owner

grobian commented Jul 10, 2016

It's not about trusting people, it's about showing accurate examples of how it can go wrong. Sofar, both you and Damian have been talking about how much thing can go wrong, but without showing any relevant example (the while loop isn't relevant in this case!) proving or demonstrating it does.

You talk about repeatability of aggregations, but you seem to forget that by the nature of the streaming scenario, the aggregations will never be. Yet if you'd pick clock times carefully, you might be able to produce a unit-like-test that would result in the same aggregation values over and over again. That would be a nice start to further build on your argument that the mathematics behind the aggregations are incorrect. On a sidenote, I myself back in the day I was still working on this found that the results of the aggregator unfortunately severly suffer from incomplete inputs, something that's inherent to the whole streaming thing. For small/controlled scenarios the count field can give a confidence level here, but in larger/uncontrolled aggregations this is no longer feasible to check.

@grobian
Copy link
Owner

grobian commented Aug 6, 2016

I think we're a great deal down right now, but it's not fully done.

@grobian
Copy link
Owner

grobian commented Sep 11, 2016

I get clean runs now, but I'm not guaranteeing there's none left

@dgryski
Copy link
Contributor

dgryski commented Sep 11, 2016

Thanks. Will try this out in our environment Monday or Tuesday and let you know.

@grobian
Copy link
Owner

grobian commented Sep 11, 2016

I can guarantee that the aggregator got a lot slower because I had to apply a larger lock.

@grobian
Copy link
Owner

grobian commented Dec 25, 2016

I think I addressed all TSAN issues

@grobian grobian closed this as completed Dec 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants