Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word2Vec original C is faster #1291

Closed
tmsimont opened this issue Apr 25, 2017 · 16 comments
Closed

Word2Vec original C is faster #1291

tmsimont opened this issue Apr 25, 2017 · 16 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature performance Issue related to performance (in HW meaning)

Comments

@tmsimont
Copy link

tmsimont commented Apr 25, 2017

Cython is installed, gensim is version 0.12.1

print gensim.models.word2vec.FAST_VERSION

says 1

To generate the gensim results, I have run this:

import gensim, logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, filename='ns.log')
sentences = word2vec.Text8Corpus('../code/c-implementation/text8')

#print gensim.models.word2vec.FAST_VERSION

for i in range(1,49):
  model = gensim.models.Word2Vec(sentences, size=100,workers=i,window=8,hs=0,negative=5,sample=1e-4)
  model.save_word2vec_format('text8-ns.model.bin', binary=False)

I ran the c-implementation like this:

./word2vec -train /home/trevor/code/c-implementation/text8 -output vectors-c.bin -cbow 0 -size 100 -window 8 -negative 5 -hs 0 -sample 1e-4 -threads $1 -binary 0 -iter 10

(I looped this script and passed in seq 48)

The c version was built with these flags:

CFLAGS = -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -g

The machine is a single node that has 2 Intel Xeon E5-2650v4 Broadwell-EP CPUs with 24 total cores (12 cores per processor). The cpu's support hyperthreading, which is why my expiriments go up to 48 threads (this thing is a beast)

Results:
c-gensim

Raw:
https://gist.github.com/tmsimont/451f3fa17ef28ae57cb87d55ca04245a

Gensim is slower at all number of threads, and seems to be unscalable beyond the 12 cores on a single processor.

Any idea why the original c version is so much faster at all numbers of threads?

@tmsimont
Copy link
Author

I should note that I am using virtualenv... I'm not sure if that affects the speed?

@tmsimont
Copy link
Author

output log attached, too
ns.log.orig.txt

@tmsimont
Copy link
Author

It seems the key is the compiler optimizations on the C program. Without them, gensim is much faster than the C implementation. It seems then that gensim is only faster than C without compiler optimization?

$ gcc -v
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/4.8.5/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-bootstrap --enable-shared --enable-threads=posix --enable-checking=release --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-linker-hash-style=gnu --enable-languages=c,c++,objc,obj-c++,java,fortran,ada,go,lto --enable-plugin --enable-initfini-array --disable-libgcj --with-isl=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/isl-install --with-cloog=/builddir/build/BUILD/gcc-4.8.5-20150702/obj-x86_64-redhat-linux/cloog-install --enable-gnu-indirect-function --with-tune=generic --with-arch_32=x86-64 --build=x86_64-redhat-linux
Thread model: posix
gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC) 

@gojomo
Copy link
Collaborator

gojomo commented Apr 25, 2017

It's a known issue that gensim's Cython routines don't get the same nearly-linear speedup with the number-of-cores.

A major factor is that some portions of the implementation are still in pure Python, or otherwise still hold the "GIL" – notably the corpus iteration/tokenization, parcelling of job-sized chunks to threads, and lookup of word-tokens to array-indexes. All such sections are still limited to a single thread, no matter how many workers configured. Many threads can still be in no-GIL operations at that time – so one way to get relatively higher thread utilization is to choose options that make no-GIL operations take more time, such as larger vector size, more negative samples, or a larger window.

Another likely factor is how the C version only reads data in one way, from a single file and format, whereas gensim takes any Python iterator. The C version thus can simply tell each thread to do its own IO starting at different seek-points in the file. (I believe a side effect, usually too minor to be a consideration, is that some regions of the file could be trained slightly more or less than the exact chosen iteration count.) You might see better gensim performance removing IO/tokenization as a factor, by converting the corpus iterator to an in-memory list-of-lists-of-strings, before ever passing it to Word2Vec (so all operations happen on already-in-memory, already-tokenized texts).

It's also possible a factor (not as sure about this) in your particular results is gensim's default job-chunking sizes, given the small size of the text8 corpus – it might only create enough chunks for fewer-than-the-full-number-of-threads, or face more idleness around the start and finish (where gensim assigns exactly the requested training to its threads, while original word2vec.c just has all threads go full speed until the expected count of training examples is finished). So you might see closer performance, or shift the plateau out somewhat, when using a larger corpus. (Forcing more iterations might achieve a similar effect.)

(Text8 is weird another way - its lack of normal, varied sentence-breaks. Unsure if this could make gensim slightly faster or slower than with more typical corpora, but it might have a slight effect.)

@gojomo
Copy link
Collaborator

gojomo commented Apr 25, 2017

Also: running in a virtualenv shouldn't affect performance. The only thing 'virtual' about such an environment are its paths for discovering executables/modules/libraries – there's no virtualization-of-other-services overhead.

@tmsimont
Copy link
Author

Is this iterator not optimized?

sentences = word2vec.Text8Corpus('../code/c-implementation/text8')

I'm not familiar with how to do as you say with the optimization of this iterator.

Could it be possible that the compiler optimization simply makes the C code run faster than the gensim python implementation?

@gojomo
Copy link
Collaborator

gojomo commented Apr 25, 2017

Here's the Text8Corpus code:

https://github.com/RaRe-Technologies/gensim/blob/14357c182a61c319f591de2cd03b440105144d3a/gensim/models/word2vec.py#L1477

What kind of 'optimized' do you mean? It's still just doing pure-Python (single-threaded) IO and string operations. And for the default iter=5 it will have to do those things once for the vocabulary-scan, then another 5 times during training. If you have the memory, try...

sentences = list(word2vec.Text8Corpus('../code/c-implementation/text8')

...to only do IO/string-stuff once, then train Word2Vec from already in-memory, already-tokenized lists-of-strings.

I'm sure the C-compiler optimizations help, but I didn't see your test numbers for the no-optimizations code. And the cythonized portions of the gensim code were compiled by the same system C-compiler, probably with similar optimization options. So I doubt it's the largest fact.

You can just look at CPU core utilization, for large numbers of threads during training, and see that the C code nearly saturates all assigned cores, whereas the gensim/Python code does not – so the threading-limitations are a major factor (and perhaps the largest factor).

@tmsimont
Copy link
Author

Yes I'm sure that the threading limitations are the largest factor for more than 12 cores, but even for 12 and under, and even just for 1 core, the C implementation is faster. I will try turning the sentences to a list first. and re-run the test.

@gojomo
Copy link
Collaborator

gojomo commented Apr 25, 2017

Updated my comment above to add point that Cython code is compiled by the same C-compiler, likely with same optimization level.

Things to try include:

  • in-memory corpus
  • larger corpus (perhaps just by using a larger iter value)
  • more compute-intensive parameters (to extend the time spent in nogil blocks): size++, negative++, window++

The interthread handoffs in gensim – a single producer-thread reading the iterator, batching examples together, feeding worker threads – are also a likely factor, compared to the C-code where every thread just opens its own handle into a different starting-place in the same file.

@tmylk
Copy link
Contributor

tmylk commented May 2, 2017

Current status: Awaiting a benchmark on a large in-memory corpus.

@tmsimont
Copy link
Author

tmsimont commented May 3, 2017

Changing the input to a list first helps a little bit, but it's still falling short of the C implementation.

I'm using less cores now for better comparison. I'd rather not distract everyone with scalability issues. All of these 12 cores are on a single chip with shared L3 cache, private L1 and L2 caches.

Is there really just a single producer thread? That seems like the most likely bottleneck here. I'm OK with C being faster, but want to confirm that I'm not misunderstanding this. There's a lengthy blog post that claims a huge speedup over the original C code in gensim, but that doesn't seem to be the reality here. Is C faster? Or is there something wrong with my code?

import gensim, logging
from gensim.models import word2vec

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO, filename='ns.log')
sentences = list(word2vec.Text8Corpus('../code/c-implementation/text8'))

for i in range(1,13):
  model = gensim.models.Word2Vec(sentences, size=100,workers=i,window=8,hs=0,negative=5,sample=1e-4)
  model.save_word2vec_format('text8-ns.model.bin', binary=False)

c-compare

raw: https://gist.github.com/tmsimont/0079d8923be35a8d4653effecd604b34

@gojomo
Copy link
Collaborator

gojomo commented May 3, 2017

There is a single producer thread, and only some of the (most-compute-intensive) code is eligible for full multithreading – with the pure Python parts still subject to the Python GIL. These two factors seem sufficient to me to explain the shortfall, and core utilization readouts may help confirm this... but I'm repeating myself. Trying other variations of parameters or corpus, as suggested above, may create more insight into where gensim may be most competitive.

(Not sure exactly why those 2013 Mac benchmarks showed a gensim advantage, but both word2vec.c & gensim have changed since then, and 2017 tests on a Linux system are likely to have relevant differences in compiler, libraries, and more.)

@piskvorky
Copy link
Owner

piskvorky commented May 18, 2017

The biggest difference (in favour of gensim) is the BLAS library used. Which I do not see mentioned in this thread.

@tmsimont which BLAS library is your scipy linked against (scipy.show_config())?

@piskvorky
Copy link
Owner

@tmsimont ping

@tmsimont
Copy link
Author

tmsimont commented Aug 3, 2017

@piskvorky Sorry to go so long without a response. New job... major life changes, etc...

lapack_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
blas_opt_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77
openblas_info:
    libraries = ['openblas']
    library_dirs = ['/usr/local/lib']
    language = f77


@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills performance Issue related to performance (in HW meaning) labels Oct 2, 2017
@gojomo
Copy link
Collaborator

gojomo commented Oct 10, 2017

I believe the main bottlenecks here are the single-distributor-thread implementation, and general Python GIL contention. Conversation & potential improvements on those issues should continue on #336. Closing this issue in favor of that earliest report of such issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature performance Issue related to performance (in HW meaning)
Projects
None yet
Development

No branches or pull requests

5 participants