-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race Condition in Muli-threaded word2vec #571
Comments
I don't see how this could be non-atomic. Could you post (gist) your verification tests? |
OK, it looks like in-place increment is not atomic in Python, so this may indeed lead to a race condition. Still interested to see your tests though. I've never seen the "training often never finishes" in any runs or tests, so I'm curious what numbers you got. |
As far as I know only single opcodes are guaranteed to be thread-safe and "+=" consists of reading, adding and writing (loosely speaking - didn't verify the exact opcodes) the value of the variable. I can whip up some code showing this tomorrow. |
I ran some computations on huge data (>1billion) words and 12 threads - the process never finished (2 of 2 times) and was hanging at the end. I also experienced it on smaller input (~10 million words), but it's less likely to occur (maybe 1 in 4 times), which makes sense. |
Sure, thanks for reporting! What version of gensim was that on? Develop or some release? |
I used the current develop branch (from two days ago). Batch based multi-threading scales a lot better in my experiments! (not sure if it's in a release version already) |
Acquiring a lock before the increment and releasing it after would probably be an easy fix, or do you see any problems there?
|
Yes, either that, or using a C counter (like in the page I linked to), making the operation atomic. If using a lock, can you time the difference before/after the change? Say on the text9 corpus. Just as a sanity check, to make sure nothing unexpectedly horrible happened to performance. Cheers! |
Still curious to see some stats from @elmar-haussmann too. I find it weird that he sees a problem (word2vec hangs) 25% of the time, while we've run thousands of tests and runs by this point, and never saw it. I would like to get to the bottom of this disparity, to make sure there's not another issue being masked. |
Can you be a bit more specific on what kind of stats you think are valuable and I'll be happy to try and produce them. I am using Anaconda Python 3.5.1 (and the Intel MKL BLAS libraries). I'd assume a different Python interpreter can play a role in this. |
I've seen occasional hangs in unit test runs, both locally (OSX) and on the TravisCI (Linux) and AppVeyor (Windows) builds, which are likely due to this same issue. Also, there were some intermittent new logged warnings of the form This #581 should fix; it only uses local vars in each thread for running counts, and leverages the existing It also updates the |
I am currently running a large test (50 billion words). Will report as soon as it's done (~2 days). |
Can confirm this is fixed. Wasn't able to reproduce any hangs. |
Fixed |
The current development branch uses batch-based multi-threading. Some parts of the code are not synchronized properly and susceptible to race conditions. In particular, workers execute
self.jobs_left -= 1
and the job producer:self.jobs_left += 1
. This is not an atomic operation. Depending on timing and release of the GIL this leads to a wrong number of jobs left. Since the termination condition for training checks whether jobs are left, training often never finishes properly.I verified this by outputting the number of jobs left at the end of training. The longer the training before, the higher the chances of a corrupted value and training not terminating.
Access should be properly synchronized (via locking) or another thread-safe data structure could be used instead.
The text was updated successfully, but these errors were encountered: