-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2Vec original C is faster #1291
Comments
I should note that I am using |
output log attached, too |
It seems the key is the compiler optimizations on the C program. Without them, gensim is much faster than the C implementation. It seems then that gensim is only faster than C without compiler optimization?
|
It's a known issue that gensim's Cython routines don't get the same nearly-linear speedup with the number-of-cores. A major factor is that some portions of the implementation are still in pure Python, or otherwise still hold the "GIL" – notably the corpus iteration/tokenization, parcelling of job-sized chunks to threads, and lookup of word-tokens to array-indexes. All such sections are still limited to a single thread, no matter how many Another likely factor is how the C version only reads data in one way, from a single file and format, whereas gensim takes any Python iterator. The C version thus can simply tell each thread to do its own IO starting at different seek-points in the file. (I believe a side effect, usually too minor to be a consideration, is that some regions of the file could be trained slightly more or less than the exact chosen iteration count.) You might see better gensim performance removing IO/tokenization as a factor, by converting the corpus iterator to an in-memory list-of-lists-of-strings, before ever passing it to Word2Vec (so all operations happen on already-in-memory, already-tokenized texts). It's also possible a factor (not as sure about this) in your particular results is gensim's default job-chunking sizes, given the small size of the text8 corpus – it might only create enough chunks for fewer-than-the-full-number-of-threads, or face more idleness around the start and finish (where gensim assigns exactly the requested training to its threads, while original word2vec.c just has all threads go full speed until the expected count of training examples is finished). So you might see closer performance, or shift the plateau out somewhat, when using a larger corpus. (Forcing more iterations might achieve a similar effect.) (Text8 is weird another way - its lack of normal, varied sentence-breaks. Unsure if this could make gensim slightly faster or slower than with more typical corpora, but it might have a slight effect.) |
Also: running in a virtualenv shouldn't affect performance. The only thing 'virtual' about such an environment are its paths for discovering executables/modules/libraries – there's no virtualization-of-other-services overhead. |
Is this iterator not optimized?
I'm not familiar with how to do as you say with the optimization of this iterator. Could it be possible that the compiler optimization simply makes the C code run faster than the gensim python implementation? |
Here's the Text8Corpus code: What kind of 'optimized' do you mean? It's still just doing pure-Python (single-threaded) IO and string operations. And for the default
...to only do IO/string-stuff once, then train Word2Vec from already in-memory, already-tokenized lists-of-strings. I'm sure the C-compiler optimizations help, but I didn't see your test numbers for the no-optimizations code. And the cythonized portions of the gensim code were compiled by the same system C-compiler, probably with similar optimization options. So I doubt it's the largest fact. You can just look at CPU core utilization, for large numbers of threads during training, and see that the C code nearly saturates all assigned cores, whereas the gensim/Python code does not – so the threading-limitations are a major factor (and perhaps the largest factor). |
Yes I'm sure that the threading limitations are the largest factor for more than 12 cores, but even for 12 and under, and even just for 1 core, the C implementation is faster. I will try turning the sentences to a list first. and re-run the test. |
Updated my comment above to add point that Cython code is compiled by the same C-compiler, likely with same optimization level. Things to try include:
The interthread handoffs in gensim – a single producer-thread reading the iterator, batching examples together, feeding worker threads – are also a likely factor, compared to the C-code where every thread just opens its own handle into a different starting-place in the same file. |
Current status: Awaiting a benchmark on a large in-memory corpus. |
Changing the input to a list first helps a little bit, but it's still falling short of the C implementation. I'm using less cores now for better comparison. I'd rather not distract everyone with scalability issues. All of these 12 cores are on a single chip with shared L3 cache, private L1 and L2 caches. Is there really just a single producer thread? That seems like the most likely bottleneck here. I'm OK with C being faster, but want to confirm that I'm not misunderstanding this. There's a lengthy blog post that claims a huge speedup over the original C code in gensim, but that doesn't seem to be the reality here. Is C faster? Or is there something wrong with my code?
raw: https://gist.github.com/tmsimont/0079d8923be35a8d4653effecd604b34 |
There is a single producer thread, and only some of the (most-compute-intensive) code is eligible for full multithreading – with the pure Python parts still subject to the Python GIL. These two factors seem sufficient to me to explain the shortfall, and core utilization readouts may help confirm this... but I'm repeating myself. Trying other variations of parameters or corpus, as suggested above, may create more insight into where gensim may be most competitive. (Not sure exactly why those 2013 Mac benchmarks showed a gensim advantage, but both word2vec.c & gensim have changed since then, and 2017 tests on a Linux system are likely to have relevant differences in compiler, libraries, and more.) |
The biggest difference (in favour of gensim) is the BLAS library used. Which I do not see mentioned in this thread. @tmsimont which BLAS library is your scipy linked against ( |
@tmsimont ping |
@piskvorky Sorry to go so long without a response. New job... major life changes, etc...
|
I believe the main bottlenecks here are the single-distributor-thread implementation, and general Python GIL contention. Conversation & potential improvements on those issues should continue on #336. Closing this issue in favor of that earliest report of such issues. |
Cython is installed, gensim is version 0.12.1
says
1
To generate the gensim results, I have run this:
I ran the c-implementation like this:
(I looped this script and passed in
seq 48
)The c version was built with these flags:
The machine is a single node that has 2 Intel Xeon E5-2650v4 Broadwell-EP CPUs with 24 total cores (12 cores per processor). The cpu's support hyperthreading, which is why my expiriments go up to 48 threads (this thing is a beast)
Results:
Raw:
https://gist.github.com/tmsimont/451f3fa17ef28ae57cb87d55ca04245a
Gensim is slower at all number of threads, and seems to be unscalable beyond the 12 cores on a single processor.
Any idea why the original c version is so much faster at all numbers of threads?
The text was updated successfully, but these errors were encountered: