Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential data race #1092

Closed
typeless opened this issue Dec 19, 2018 · 2 comments
Closed

Potential data race #1092

typeless opened this issue Dec 19, 2018 · 2 comments
Labels

Comments

@typeless
Copy link

#1091 is a sample code that could produce a data race.
I had several attempts to fix it but cannot reach a satisfying result.
Please have a look.

@mschoch
Copy link
Contributor

mschoch commented Feb 1, 2019

Thanks for reporting this, I have taken a look.

  1. First, although the test you provided demonstrated the problem in upsidedown, I suspected it may affect both upsidedown and scorch (or at least we should test both to see). I adapted your test to run one level higher, at the top-level bleve package.

  2. The results were a bit surprising, because we already encountered a similar bug in the past (Data race in Batch.Reset() #260). And further, we added a test for that then, and that test is passing with the race detector (for both upsidedown and scorch).

  3. So what is different about your test? Notably, you're indexing empty batches. In fact, if you add a line to your test to index a document each time, you'll notice that the race detector passes. So, there was something special about the case of indexing empty batches, and reusing the batch that was triggering the race detector.

What I've determined is that when there documents in the batch, we internally synchronize things by going over a channel, thus the race detector can prove that the access is safe. However, when there are no documents in the batch, the loop iteration happens 0 times, and we never go through that channel synchronization, and thus the race detector correctly points out this bug.

It turns out the fix is straightforward, we can eliminate the unsynchronized access by the other goroutine in the case when there are no documents, thus removing the race.

I'll be submitting a PR to address this issue shortly.

@mschoch mschoch added the bug label Feb 1, 2019
@mschoch
Copy link
Contributor

mschoch commented Feb 1, 2019

Fixed by #1121

@mschoch mschoch closed this as completed Feb 1, 2019
mschoch added a commit to mschoch/bleve that referenced this issue Mar 3, 2019
this is another variation of the race found/fixed in blevesearch#1092
in that case the batch was empty, which meant we would skip
the code that properly synchronized access.  our fix only
handled this exact case (no data operations), however
there is another variation, if the batch contains only deletes
(which are data ops) we still spawned the goroutine, although
since there were no real updates, the again the synchronization
code would be skipped, and thus the data race could happen.

the fix is to check the number of updates (computed earlier on
the caller's goroutine, so it's safe) instead of the length
of the IndexOps (which includes updates and deletes)

the key is that we should only spawn the goroutine that will
range over the batch, in cases where we will synchronize on
waiting for the analysis to complete (at least one update).
mschoch added a commit that referenced this issue Mar 3, 2019
this is another variation of the race found/fixed in #1092
in that case the batch was empty, which meant we would skip
the code that properly synchronized access.  our fix only
handled this exact case (no data operations), however
there is another variation, if the batch contains only deletes
(which are data ops) we still spawned the goroutine, although
since there were no real updates, again the synchronization
code would be skipped, and thus the data race could happen.

the fix is to check the number of updates (computed earlier on
the caller's goroutine, so it's safe) instead of the length
of the IndexOps (which includes updates and deletes)

the key is that we should only spawn the goroutine that will
range over the batch, in cases where we will synchronize on
waiting for the analysis to complete (at least one update).

fixes #1149
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants