Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indexing error in word2vec_inner.pyx #3136

Merged
merged 7 commits into from
May 13, 2021

Conversation

bluekura
Copy link
Contributor

@bluekura bluekura commented May 10, 2021

Fix #3100

Problem description
Main tests and discussions are in #3100
If vocab_size * vector_dim > 2^32, there was errors on the indexing. In such a case, the vectors outside train() also update unexpectedly.

Solution
To calculate the index in Cython code (word2vec.pxd). the array index is calculated as vector_dimension (as python int) * vocab_index (as np.uint32_t) and then assigned to the long long variable without explicit typecasting. However, in this case, I believe Cython need to cast the type explicitly to avoid this problem. This typecasting should not impact much on the performance.

(Add-on) to test the original issue on the vectors_lockf, I set vectors_lockf for words [0:500000). It also works fine if and only if vectors_lockf is an array with dtype of np.float32_t. Because the default dtype of NumPy ndarray is np.float64_t (for the 64bit systems), the following solution can help. However, as @gojomo told me, the lock function is experimental and thus I decided not to update this part.

Line 479, word2vec_inner.pyx

c[0].words_lockf = <REAL_t *>(np.PyArray_DATA(model.wv.vectors_lockf))

->

c[0].words_lockf = <REAL_t *>(np.PyArray_DATA(model.wv.vectors_lockf.astype(np.float32))

Test:
Train ~50,000,000 randomly generated words (string numbers) then train more with randomly generated numbers from [0:551000).

Original Gensim code
Dims: 100, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 100, [0: 551000): 473373 vectors changed
Dims: 100, [551000:vocab_max_number): 152651 vectors changed: Should not be updated

This code
Dims: 100, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 100, [0: 550000): 0 vectors changed
Dims: 100, [550000: 551000): 1000 vectors changed
Dims: 100, [551000:vocab_max_number): 0 vectors changed

Dims: 150, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 150, [0: 550000): 0 vectors changed
Dims: 150, [550000: 551000): 1000 vectors changed
Dims: 150, [551000:vocab_max_number): 0 vectors changed

Dims: 200, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 200, [0: 550000): 0 vectors changed
Dims: 200, [550000: 551000): 1000 vectors changed
Dims: 200, [551000:vocab_max_number): 0 vectors changed

@mpenkov
Copy link
Collaborator

mpenkov commented May 10, 2021

Thank you for your contribution. I understand that it improves correctness, but decreases readability and maintanability.

Would it be possible to refactor the code to use an additional function call without a performance hit?

e.g.

def _mul(a, b):
    """Informative docstring here."""
    return <long long>a * <long long>b

row2 = _mul(word_point[b], size)

That way, the readability and maintanability will not decrease as a result of your contribution.

@bluekura
Copy link
Contributor Author

I think using additional function would not impact the performance (at least not so much).

@bluekura
Copy link
Contributor Author

bluekura commented May 10, 2021

However, the function is with nogil, it should be declared as cdef (maybe with inline).

e.g.

cdef long long _mul(const np.uint32_t a, const int b) nogil: 
    """Safe multiplication of ints with explict typecasting"""
    return <long long>a * <long long>b

@gojomo
Copy link
Collaborator

gojomo commented May 10, 2021

Thanks for the fix! I don't find the extra-casts-for-correctness-sake to hurt readability. (It might be sufficient to only cast one of the operands for the result to be correct, or to make the changes more compact with some other type-def of long long to a shorter type - but I'd also find those less clear.

@bluekura
Copy link
Contributor Author

bluekura commented May 10, 2021

I have updated code accordingly (using additional cdef function).
I found that circleci build fails for fasttext, which haven't modified. I can't guess why.
It does not impact the performance (at least much).

@gojomo
Copy link
Collaborator

gojomo commented May 12, 2021

To go further than my previous comment: I think the indirection of a 1-line _mul() function makes the code less clear. (And, unless something causes them to be automatically or by-directive inlined, using a function will likely have a performance cost.)

@bluekura
Copy link
Contributor Author

@gojomo I think there is not many performance loss because modern compilers handle the performance well. Anyway, what is your suggestion? roll back to the inline typecast?

@mpenkov
Copy link
Collaborator

mpenkov commented May 13, 2021

If there is no performance cost, then I think the code is fine as is. The remaining point of contention (is the code clearer or not) is debatable, but there is little benefit in engaging in that debate.

@bluekura Thank you for your work. Congrats on your first contribution to the gensim repo! 🥇

@mpenkov mpenkov changed the title Fix issue #3100 (word-vectors outside train()-update tokens being changed). fix indexing error in word2vec_inner.pyx May 13, 2021
@mpenkov mpenkov merged commit ec95fe4 into piskvorky:develop May 13, 2021
@piskvorky piskvorky changed the title fix indexing error in word2vec_inner.pyx Fix indexing error in word2vec_inner.pyx May 13, 2021
@piskvorky
Copy link
Owner

piskvorky commented May 13, 2021

The fix is great but If there is no performance cost is a big "if".

Does it follow from It does not impact the performance (at least much) and I think there is not many performance loss?

The word2vec model (and its derivations) are crucial to Gensim, so I'd like to see more rigour here before merging. @bluekura do you have any numbers, can you share the benchmark setup? Thanks.

@mpenkov
Copy link
Collaborator

mpenkov commented May 13, 2021

OK, should we roll back the merge while we work this thing out?

@piskvorky
Copy link
Owner

piskvorky commented May 13, 2021

No need, if the perf numbers support the claim.

But if we find a serious regression (~more than a few percent), we'll have to revisit this, yes.

@bluekura
Copy link
Contributor Author

bluekura commented May 13, 2021

Okay. Here is my test on 50K Vocabs, 100K sentences, 50 words per sentence, 5 epochs, 100 vector dimension, Threadripper 3990X with 8 workers + ATLAS.

Original Gensim 4.0.0 (distributed version): 24.8 s ± 403 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Linux-5.4.0-72-generic-x86_64-with-glibc2.31
Python 3.9.2 (default, Mar  3 2021, 20:02:32) 
[GCC 7.3.0]
Bits 64
NumPy 1.20.2
SciPy 1.6.2
gensim 4.0.0
FAST_VERSION 0

Modified Gensim 4.1.0-dev (this version): 23.7 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2nd run of Modified Gensim 4.1.0-dev (this version): 24.1 s ± 427 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Linux-5.4.0-72-generic-x86_64-with-glibc2.31
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) 
[GCC 9.3.0]
Bits 64
NumPy 1.20.2
SciPy 1.6.3
gensim 4.1.0.dev0
FAST_VERSION 0

Considering the measurement error, I think they are subequal in performance. Indeed, typecasting is a "super cheap" calculation, especially compared with the BLAS operation. Yeah!

@gojomo
Copy link
Collaborator

gojomo commented May 14, 2021

It's nice to see no obvious performance hit in a tiny trial (though I'm not sure this is enough of a test to be sure it's being inlined in all relevant architectures). It's not the casting that's a concern - that's both cheap, & in any case required for the correct behavior. It's the extra indirection of a function/function-call, which could add overhead, and is not required for correct behavior.

So, it strikes me as absurd to hide casting-for-correctness inside the indirection of a function call, for something as simple as multiplication. A person reading the code shouldn't have to think, "is this _mul() something special that I have to jump to to understand intent?" The casting inline, where it's needed, is clearest.

@bluekura
Copy link
Contributor Author

bluekura commented May 15, 2021

Yes I also agree. That is the reason why my original code just cast inline (straightforward and simple). This is a problem of philosophy. I demonstrated the problem and cause, along with possible working solutions. I think it is your (that is, more engaged collaborators of gensim) turn.

@piskvorky
Copy link
Owner

piskvorky commented May 15, 2021

Thank you @bluekura.

My preference is also for direct, easy-to-read code (although I don't feel about it as strongly as @gojomo in this instance).

I wonder whether Cython used to give us a compilation warning here (which we ignored), and whether more problems of similar nature still lurk in the Gensim code base.

@gojomo
Copy link
Collaborator

gojomo commented Jul 22, 2021

It seems odd to me that the original author & 2-of-3 commenters preferred an inline fix at each multiplication, without the extra indirection of a tiny _mul() function - but the version of the fix with that indirection got merged.

@piskvorky
Copy link
Owner

@gojomo: that was fixed by @mpenkov in #3143.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants