-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix indexing error in word2vec_inner.pyx #3136
Conversation
Thank you for your contribution. I understand that it improves correctness, but decreases readability and maintanability. Would it be possible to refactor the code to use an additional function call without a performance hit? e.g. def _mul(a, b):
"""Informative docstring here."""
return <long long>a * <long long>b
row2 = _mul(word_point[b], size) That way, the readability and maintanability will not decrease as a result of your contribution. |
I think using additional function would not impact the performance (at least not so much). |
However, the function is with nogil, it should be declared as cdef (maybe with inline). e.g. cdef long long _mul(const np.uint32_t a, const int b) nogil:
"""Safe multiplication of ints with explict typecasting"""
return <long long>a * <long long>b |
Thanks for the fix! I don't find the extra-casts-for-correctness-sake to hurt readability. (It might be sufficient to only cast one of the operands for the result to be correct, or to make the changes more compact with some other type-def of |
I have updated code accordingly (using additional cdef function). |
To go further than my previous comment: I think the indirection of a 1-line |
@gojomo I think there is not many performance loss because modern compilers handle the performance well. Anyway, what is your suggestion? roll back to the inline typecast? |
If there is no performance cost, then I think the code is fine as is. The remaining point of contention (is the code clearer or not) is debatable, but there is little benefit in engaging in that debate. @bluekura Thank you for your work. Congrats on your first contribution to the gensim repo! 🥇 |
The fix is great but Does it follow from The word2vec model (and its derivations) are crucial to Gensim, so I'd like to see more rigour here before merging. @bluekura do you have any numbers, can you share the benchmark setup? Thanks. |
OK, should we roll back the merge while we work this thing out? |
No need, if the perf numbers support the claim. But if we find a serious regression (~more than a few percent), we'll have to revisit this, yes. |
Okay. Here is my test on 50K Vocabs, 100K sentences, 50 words per sentence, 5 epochs, 100 vector dimension, Threadripper 3990X with 8 workers + ATLAS. Original Gensim 4.0.0 (distributed version): 24.8 s ± 403 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Modified Gensim 4.1.0-dev (this version): 23.7 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Considering the measurement error, I think they are subequal in performance. Indeed, typecasting is a "super cheap" calculation, especially compared with the BLAS operation. Yeah! |
It's nice to see no obvious performance hit in a tiny trial (though I'm not sure this is enough of a test to be sure it's being inlined in all relevant architectures). It's not the casting that's a concern - that's both cheap, & in any case required for the correct behavior. It's the extra indirection of a function/function-call, which could add overhead, and is not required for correct behavior. So, it strikes me as absurd to hide casting-for-correctness inside the indirection of a function call, for something as simple as multiplication. A person reading the code shouldn't have to think, "is this |
Yes I also agree. That is the reason why my original code just cast inline (straightforward and simple). This is a problem of philosophy. I demonstrated the problem and cause, along with possible working solutions. I think it is your (that is, more engaged collaborators of gensim) turn. |
Thank you @bluekura. My preference is also for direct, easy-to-read code (although I don't feel about it as strongly as @gojomo in this instance). I wonder whether Cython used to give us a compilation warning here (which we ignored), and whether more problems of similar nature still lurk in the Gensim code base. |
It seems odd to me that the original author & 2-of-3 commenters preferred an inline fix at each multiplication, without the extra indirection of a tiny |
Fix #3100
Problem description
Main tests and discussions are in #3100
If vocab_size * vector_dim > 2^32, there was errors on the indexing. In such a case, the vectors outside train() also update unexpectedly.
Solution
To calculate the index in Cython code (word2vec.pxd). the array index is calculated as vector_dimension (as python int) * vocab_index (as np.uint32_t) and then assigned to the long long variable without explicit typecasting. However, in this case, I believe Cython need to cast the type explicitly to avoid this problem. This typecasting should not impact much on the performance.
(Add-on) to test the original issue on the vectors_lockf, I set vectors_lockf for words [0:500000). It also works fine if and only if vectors_lockf is an array with dtype of np.float32_t. Because the default dtype of NumPy ndarray is np.float64_t (for the 64bit systems), the following solution can help. However, as @gojomo told me, the lock function is experimental and thus I decided not to update this part.
Line 479, word2vec_inner.pyx
->
Test:
Train ~50,000,000 randomly generated words (string numbers) then train more with randomly generated numbers from [0:551000).
Original Gensim code
Dims: 100, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 100, [0: 551000): 473373 vectors changed
Dims: 100, [551000:vocab_max_number): 152651 vectors changed: Should not be updated
This code
Dims: 100, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 100, [0: 550000): 0 vectors changed
Dims: 100, [550000: 551000): 1000 vectors changed
Dims: 100, [551000:vocab_max_number): 0 vectors changed
Dims: 150, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 150, [0: 550000): 0 vectors changed
Dims: 150, [550000: 551000): 1000 vectors changed
Dims: 150, [551000:vocab_max_number): 0 vectors changed
Dims: 200, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 200, [0: 550000): 0 vectors changed
Dims: 200, [550000: 551000): 1000 vectors changed
Dims: 200, [551000:vocab_max_number): 0 vectors changed