Various race conditions reported by TSAN #199

Civil · 2016-07-05T11:00:00Z

https://gist.github.com/Civil/e4aaffb76be7e216cc4f259e162b70b9 - TSAN report. Around 40 unique places where they happens.

Also please note that all TSAN stuff is 100% bugs - e.x. you are doing writes while having read mutex, atomic and not atomic writes from different threads at the same time, etc.

So to be sure that everything is correct, it's absolutely must to fix ALL of them.

grobian · 2016-07-05T11:12:51Z

Right, so the gcc atomic instructions are races?
Most of them look like false positives, which the empirical proof also suggests.

dgryski · 2016-07-05T12:04:02Z

Any value accessed from multiple threads needs protection with either a mutex or to always be accessed with atomic intrinsics.

TSAN does not have any false positives. If it detects a race, then there really was a race: two threads tried to access the same piece of memory, and at least one was a write.

https://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Civil · 2016-07-05T12:07:17Z

Also please note that it takes some time to actually understand the traces. All stuff that's about "gcc atomics access the value" is there mostly because there are places where the same value accessed in non-atomic way.

grobian · 2016-07-05T12:07:19Z

Yes, but unless I misunderstand what it warns about, in this case it doesn't matter that it races, because the code uses optimistic concurrency.
If a counter increment fails, meh. If an availability check fails while it could have succeed, meh, better luck next time.

dgryski · 2016-07-05T12:11:30Z

The compiler generates correct code assuming there is no concurrent access to the values. Higher optimization levels or newer compilers can easily spill random values to memory as long as everything is back where it should be when the basic block is finished.

For example,

int done;

void func() {
   while(!done) {
    ....
   }
}

Assuming there is no access to done inside the while loop, the compiler is justified in replacing while(!done) with while(1), since that is the behaviour with no concurrent access.

Out-of-order execution, memory flushes, and silent memory corruption can lead to even more obscure issues.

TSAN warnings need to be fixed.

Civil · 2016-07-05T12:14:23Z

Also most of the warnings are about setting variable with CAS, while second thread sets it without CAS, e.x.:
https://github.com/grobian/carbon-c-relay/blob/master/dispatcher.c#L579
https://github.com/grobian/carbon-c-relay/blob/master/dispatcher.c#L474

This will result it undefined behavior and the code can work correctly with one version of compiler, but even with minor increment of version can break in unpredictable way.

grobian · 2016-07-06T16:48:13Z

Explain to me how that can cause an issue, and I'll fix it. The only thing that needs to be prevented there is that multiple threads assign the same connection to itself, hence the atomic compare and swap, because I didn't want to pthread lock it. So, please enlighten me how in this scenario the releasing of the connection by the thread itself can cause a race condition.

Civil · 2016-07-06T18:30:53Z

It's undefined behavior from compiler's point of View. You can't be sure that compiler won't optimize out some of your code.
You can't be sure that compiler will produce proper cache invalidation. Imagine if one thread thinks that value is 0 and other that it's 1.
When in one thread you are doing CAS and in other just a normal assignment (as in my examples, this code is proven to ran in different threads) result is undefined.

Without fixing this you can never be sure that aggregation code produces correct output and your data is mathematically correct which kinda good to have.

If you don't trust Damain or me, I'd suggest you to open a discussion in gcc maillist for example, about how correct your assumptions are at this moment.

grobian · 2016-07-07T06:50:12Z

As far as I understand, we're talking about a selection mechanism for threads to handle a certain connection or not. The aggregation code doesn't use these, but uses full pthread locks instead, so I don't yet understand why you think the aggregator produces mathematically invalid results.

Civil · 2016-07-10T11:15:30Z

You have various of undefined behaviors during writes and reads all over the aggregator pipeline, starting from parsing, and up to expire and dispatching. Because of the nature of that issues, you can't guarantee that the data will reach the aggregator at all. That's why that given some input you can't guarantee that your output will be the same everytime and that it'll be correct at all. Also behavior may change when you migrate from one version of compiler to another.

That's why there is very high chance that current aggregations are mathematically incorrect at least in some cases.

Yet I understand that you were saying about optmistic concurency, but the output of TSAN says that you are doing it wrong in some places (if you access to the variable with atomic operations you should do that everywhere, if you do atomic in one function and non-atomic in another, result is unpredictable).

And again, if you don't trust me or damian, I'd suggest you to open a talk in gcc maillist for example, provide them examples of code and ask if it's correct.

grobian · 2016-07-10T14:55:03Z

It's not about trusting people, it's about showing accurate examples of how it can go wrong. Sofar, both you and Damian have been talking about how much thing can go wrong, but without showing any relevant example (the while loop isn't relevant in this case!) proving or demonstrating it does.

You talk about repeatability of aggregations, but you seem to forget that by the nature of the streaming scenario, the aggregations will never be. Yet if you'd pick clock times carefully, you might be able to produce a unit-like-test that would result in the same aggregation values over and over again. That would be a nice start to further build on your argument that the mathematics behind the aggregations are incorrect. On a sidenote, I myself back in the day I was still working on this found that the results of the aggregator unfortunately severly suffer from incomplete inputs, something that's inherent to the whole streaming thing. For small/controlled scenarios the count field can give a confidence level here, but in larger/uncontrolled aggregations this is no longer feasible to check.

grobian · 2016-08-06T16:17:54Z

I think we're a great deal down right now, but it's not fully done.

grobian · 2016-09-11T08:43:31Z

I get clean runs now, but I'm not guaranteeing there's none left

dgryski · 2016-09-11T08:46:09Z

Thanks. Will try this out in our environment Monday or Tuesday and let you know.

grobian · 2016-09-11T08:47:51Z

I can guarantee that the aggregator got a lot slower because I had to apply a larger lock.

grobian · 2016-12-25T22:45:46Z

I think I addressed all TSAN issues

toni-moreno mentioned this issue Jul 23, 2016

relay is sending data randomly while kill -HUP happens #204

Closed

grobian added a commit that referenced this issue Jul 24, 2016

dispatcher: protect takenby member with atomic access, issue #199

ce4ebfc

grobian closed this as completed Dec 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Various race conditions reported by TSAN #199

Various race conditions reported by TSAN #199

Civil commented Jul 5, 2016

grobian commented Jul 5, 2016

dgryski commented Jul 5, 2016

Civil commented Jul 5, 2016

grobian commented Jul 5, 2016

dgryski commented Jul 5, 2016

Civil commented Jul 5, 2016

grobian commented Jul 6, 2016

Civil commented Jul 6, 2016

grobian commented Jul 7, 2016

Civil commented Jul 10, 2016 •

edited

Loading

grobian commented Jul 10, 2016

grobian commented Aug 6, 2016

grobian commented Sep 11, 2016

dgryski commented Sep 11, 2016

grobian commented Sep 11, 2016

grobian commented Dec 25, 2016

Various race conditions reported by TSAN #199

Various race conditions reported by TSAN #199

Comments

Civil commented Jul 5, 2016

grobian commented Jul 5, 2016

dgryski commented Jul 5, 2016

Civil commented Jul 5, 2016

grobian commented Jul 5, 2016

dgryski commented Jul 5, 2016

Civil commented Jul 5, 2016

grobian commented Jul 6, 2016

Civil commented Jul 6, 2016

grobian commented Jul 7, 2016

Civil commented Jul 10, 2016 • edited Loading

grobian commented Jul 10, 2016

grobian commented Aug 6, 2016

grobian commented Sep 11, 2016

dgryski commented Sep 11, 2016

grobian commented Sep 11, 2016

grobian commented Dec 25, 2016

Civil commented Jul 10, 2016 •

edited

Loading