-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undo the hash2index optimization #2370
Conversation
Co-Authored-By: mpenkov <[email protected]>
@piskvorky When do you think would be the best time to release this? We can sneak it into 3.7.2 (the next bugfix release) or wait until the next minor release (3.8.0). Technically, this isn't really a bugfix, it's a feature improvement (more precisely, the removal of a bad feature), so it can wait. WDYT? |
I'd consider it a bug fix, and release as soon as possible. But #2371 NMF still blocking. |
Correcting a deviation from the behavior of Facebook's reference FastText implementation definitely seems like a bug fix to me! And it would be good to include a release-note advisory about the changed behavior, now that gensim's FastText (post-fix) will always return a vector for OOV words (and never the "all ngrams for word X absent from model" error it was throwing previously). |
Already done in the changelog on develop :) |
@mpenkov I see a note about I don't see a note that all |
OK, please have a look at this commit and let me know if it is sufficient. |
Put an extensive note on the commit. |
This optimization reduced the number of ngram buckets to include only ngrams that we have seen during training.
This seemed like a good idea at the time, because it saved CPU cycles and RAM, but turned out to be a bad idea, because it introduced divergent behavior when compared to the reference implementation. For example:
We were unable to calculate vectors for terms that were completely out of the vocab (so the term and all its ngrams were unseen). This is bad because the original FB implementation always returns a vector. It may seem useless because it's initialized to a random vector, but that's not entirely true, because that vector is random at initialization time. When we're querying the ngram's vector, the vector is deterministic, so it is useful.
Another problem is that it complicated the implementation. We now needed an additional layer of indirection that mapped hashes to bucket indices. Without this optimization, this mapping is essentially the identifiy function: the hash N always maps to the Nth bucket.
This pull request removes the optimization, resolving the problems that it introduced.
Fixes #2329