Fix indexing error in word2vec_inner.pyx #3136

bluekura · 2021-05-10T04:00:47Z

Problem description
Main tests and discussions are in #3100
If vocab_size * vector_dim > 2^32, there was errors on the indexing. In such a case, the vectors outside train() also update unexpectedly.

Solution
To calculate the index in Cython code (word2vec.pxd). the array index is calculated as vector_dimension (as python int) * vocab_index (as np.uint32_t) and then assigned to the long long variable without explicit typecasting. However, in this case, I believe Cython need to cast the type explicitly to avoid this problem. This typecasting should not impact much on the performance.

(Add-on) to test the original issue on the vectors_lockf, I set vectors_lockf for words [0:500000). It also works fine if and only if vectors_lockf is an array with dtype of np.float32_t. Because the default dtype of NumPy ndarray is np.float64_t (for the 64bit systems), the following solution can help. However, as @gojomo told me, the lock function is experimental and thus I decided not to update this part.

Line 479, word2vec_inner.pyx

c[0].words_lockf = <REAL_t *>(np.PyArray_DATA(model.wv.vectors_lockf))

->

c[0].words_lockf = <REAL_t *>(np.PyArray_DATA(model.wv.vectors_lockf.astype(np.float32))

Test:
Train ~50,000,000 randomly generated words (string numbers) then train more with randomly generated numbers from [0:551000).

Original Gensim code
Dims: 100, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 100, [0: 551000): 473373 vectors changed
Dims: 100, [551000:vocab_max_number): 152651 vectors changed: Should not be updated

This code
Dims: 100, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 100, [0: 550000): 0 vectors changed
Dims: 100, [550000: 551000): 1000 vectors changed
Dims: 100, [551000:vocab_max_number): 0 vectors changed

Dims: 150, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 150, [0: 550000): 0 vectors changed
Dims: 150, [550000: 551000): 1000 vectors changed
Dims: 150, [551000:vocab_max_number): 0 vectors changed

Dims: 200, Vocab Size: 50000000, RealVocabSize: 49997735, OPENBLAS
Dims: 200, [0: 550000): 0 vectors changed
Dims: 200, [550000: 551000): 1000 vectors changed
Dims: 200, [551000:vocab_max_number): 0 vectors changed

mpenkov · 2021-05-10T04:40:05Z

Thank you for your contribution. I understand that it improves correctness, but decreases readability and maintanability.

Would it be possible to refactor the code to use an additional function call without a performance hit?

e.g.

def _mul(a, b):
    """Informative docstring here."""
    return <long long>a * <long long>b

row2 = _mul(word_point[b], size)

That way, the readability and maintanability will not decrease as a result of your contribution.

bluekura · 2021-05-10T04:53:24Z

I think using additional function would not impact the performance (at least not so much).

bluekura · 2021-05-10T05:16:39Z

However, the function is with nogil, it should be declared as cdef (maybe with inline).

e.g.

cdef long long _mul(const np.uint32_t a, const int b) nogil: 
    """Safe multiplication of ints with explict typecasting"""
    return <long long>a * <long long>b

gojomo · 2021-05-10T08:27:20Z

Thanks for the fix! I don't find the extra-casts-for-correctness-sake to hurt readability. (It might be sufficient to only cast one of the operands for the result to be correct, or to make the changes more compact with some other type-def of long long to a shorter type - but I'd also find those less clear.

bluekura · 2021-05-10T09:01:38Z

I have updated code accordingly (using additional cdef function).
I found that circleci build fails for fasttext, which haven't modified. I can't guess why.
It does not impact the performance (at least much).

gojomo · 2021-05-12T10:54:28Z

To go further than my previous comment: I think the indirection of a 1-line _mul() function makes the code less clear. (And, unless something causes them to be automatically or by-directive inlined, using a function will likely have a performance cost.)

bluekura · 2021-05-12T10:59:30Z

@gojomo I think there is not many performance loss because modern compilers handle the performance well. Anyway, what is your suggestion? roll back to the inline typecast?

mpenkov · 2021-05-13T00:26:46Z

If there is no performance cost, then I think the code is fine as is. The remaining point of contention (is the code clearer or not) is debatable, but there is little benefit in engaging in that debate.

@bluekura Thank you for your work. Congrats on your first contribution to the gensim repo! 🥇

piskvorky · 2021-05-13T08:44:34Z

The fix is great but If there is no performance cost is a big "if".

Does it follow from It does not impact the performance (at least much) and I think there is not many performance loss?

The word2vec model (and its derivations) are crucial to Gensim, so I'd like to see more rigour here before merging. @bluekura do you have any numbers, can you share the benchmark setup? Thanks.

mpenkov · 2021-05-13T08:48:06Z

OK, should we roll back the merge while we work this thing out?

piskvorky · 2021-05-13T08:49:19Z

No need, if the perf numbers support the claim.

But if we find a serious regression (~more than a few percent), we'll have to revisit this, yes.

bluekura · 2021-05-13T09:11:35Z

Okay. Here is my test on 50K Vocabs, 100K sentences, 50 words per sentence, 5 epochs, 100 vector dimension, Threadripper 3990X with 8 workers + ATLAS.

Original Gensim 4.0.0 (distributed version): 24.8 s ± 403 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Linux-5.4.0-72-generic-x86_64-with-glibc2.31
Python 3.9.2 (default, Mar  3 2021, 20:02:32) 
[GCC 7.3.0]
Bits 64
NumPy 1.20.2
SciPy 1.6.2
gensim 4.0.0
FAST_VERSION 0

Modified Gensim 4.1.0-dev (this version): 23.7 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2nd run of Modified Gensim 4.1.0-dev (this version): 24.1 s ± 427 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Linux-5.4.0-72-generic-x86_64-with-glibc2.31
Python 3.9.2 | packaged by conda-forge | (default, Feb 21 2021, 05:02:46) 
[GCC 9.3.0]
Bits 64
NumPy 1.20.2
SciPy 1.6.3
gensim 4.1.0.dev0
FAST_VERSION 0

Considering the measurement error, I think they are subequal in performance. Indeed, typecasting is a "super cheap" calculation, especially compared with the BLAS operation. Yeah!

gojomo · 2021-05-14T22:54:01Z

It's nice to see no obvious performance hit in a tiny trial (though I'm not sure this is enough of a test to be sure it's being inlined in all relevant architectures). It's not the casting that's a concern - that's both cheap, & in any case required for the correct behavior. It's the extra indirection of a function/function-call, which could add overhead, and is not required for correct behavior.

So, it strikes me as absurd to hide casting-for-correctness inside the indirection of a function call, for something as simple as multiplication. A person reading the code shouldn't have to think, "is this _mul() something special that I have to jump to to understand intent?" The casting inline, where it's needed, is clearest.

bluekura · 2021-05-15T00:17:38Z

Yes I also agree. That is the reason why my original code just cast inline (straightforward and simple). This is a problem of philosophy. I demonstrated the problem and cause, along with possible working solutions. I think it is your (that is, more engaged collaborators of gensim) turn.

piskvorky · 2021-05-15T08:20:49Z

Thank you @bluekura.

My preference is also for direct, easy-to-read code (although I don't feel about it as strongly as @gojomo in this instance).

I wonder whether Cython used to give us a compilation warning here (which we ignored), and whether more problems of similar nature still lurk in the Gensim code base.

gojomo · 2021-07-22T15:55:56Z

It seems odd to me that the original author & 2-of-3 commenters preferred an inline fix at each multiplication, without the extra indirection of a tiny _mul() function - but the version of the fix with that indirection got merged.

piskvorky · 2021-07-26T13:21:08Z

@gojomo: that was fixed by @mpenkov in #3143.

Jinhyuk Yun added 4 commits May 9, 2021 16:35

Add explicit typecasting for the W2V cython code

c6b47ba

Add explicit typecasting for the W2V cython code

83f1ae4

Unifying the types

ee4b103

Fix minor code style issues

e6fea03

Code refactoring

a443551

gojomo mentioned this pull request May 12, 2021

word-vectors outside train()-update tokens being changed; also, vectors_lockf 0.0-values don't suppress changes #3100

Closed

mpenkov changed the title ~~Fix issue #3100 (word-vectors outside train()-update tokens being changed).~~ fix indexing error in word2vec_inner.pyx May 13, 2021

mpenkov approved these changes May 13, 2021

View reviewed changes

mpenkov added 2 commits May 13, 2021 09:34

Update CHANGELOG.md

f134e36

Merge branch 'develop' into fix-largevector

201979f

mpenkov merged commit ec95fe4 into piskvorky:develop May 13, 2021

piskvorky changed the title ~~fix indexing error in word2vec_inner.pyx~~ Fix indexing error in word2vec_inner.pyx May 13, 2021

mpenkov mentioned this pull request May 15, 2021

replace _mul function with explicit casts #3143

Merged

gojomo mentioned this pull request Jul 22, 2021

Review change log for 4.1.0 #3199

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix indexing error in word2vec_inner.pyx #3136

Fix indexing error in word2vec_inner.pyx #3136

bluekura commented May 10, 2021 •

edited by mpenkov

Loading

mpenkov commented May 10, 2021

bluekura commented May 10, 2021

bluekura commented May 10, 2021 •

edited

Loading

gojomo commented May 10, 2021

bluekura commented May 10, 2021 •

edited

Loading

gojomo commented May 12, 2021

bluekura commented May 12, 2021

mpenkov commented May 13, 2021

piskvorky commented May 13, 2021 •

edited

Loading

mpenkov commented May 13, 2021

piskvorky commented May 13, 2021 •

edited

Loading

bluekura commented May 13, 2021 •

edited

Loading

gojomo commented May 14, 2021 •

edited

Loading

bluekura commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021 •

edited

Loading

gojomo commented Jul 22, 2021

piskvorky commented Jul 26, 2021

Fix indexing error in word2vec_inner.pyx #3136

Fix indexing error in word2vec_inner.pyx #3136

Conversation

bluekura commented May 10, 2021 • edited by mpenkov Loading

mpenkov commented May 10, 2021

bluekura commented May 10, 2021

bluekura commented May 10, 2021 • edited Loading

gojomo commented May 10, 2021

bluekura commented May 10, 2021 • edited Loading

gojomo commented May 12, 2021

bluekura commented May 12, 2021

mpenkov commented May 13, 2021

piskvorky commented May 13, 2021 • edited Loading

mpenkov commented May 13, 2021

piskvorky commented May 13, 2021 • edited Loading

bluekura commented May 13, 2021 • edited Loading

gojomo commented May 14, 2021 • edited Loading

bluekura commented May 15, 2021 • edited Loading

piskvorky commented May 15, 2021 • edited Loading

gojomo commented Jul 22, 2021

piskvorky commented Jul 26, 2021

bluekura commented May 10, 2021 •

edited by mpenkov

Loading

bluekura commented May 10, 2021 •

edited

Loading

bluekura commented May 10, 2021 •

edited

Loading

piskvorky commented May 13, 2021 •

edited

Loading

piskvorky commented May 13, 2021 •

edited

Loading

bluekura commented May 13, 2021 •

edited

Loading

gojomo commented May 14, 2021 •

edited

Loading

bluekura commented May 15, 2021 •

edited

Loading

piskvorky commented May 15, 2021 •

edited

Loading