Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undo the hash2index optimization #2370

Merged
merged 10 commits into from
Mar 7, 2019
Merged

Undo the hash2index optimization #2370

merged 10 commits into from
Mar 7, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Feb 3, 2019

This optimization reduced the number of ngram buckets to include only ngrams that we have seen during training.

This seemed like a good idea at the time, because it saved CPU cycles and RAM, but turned out to be a bad idea, because it introduced divergent behavior when compared to the reference implementation. For example:

We were unable to calculate vectors for terms that were completely out of the vocab (so the term and all its ngrams were unseen). This is bad because the original FB implementation always returns a vector. It may seem useless because it's initialized to a random vector, but that's not entirely true, because that vector is random at initialization time. When we're querying the ngram's vector, the vector is deterministic, so it is useful.

Another problem is that it complicated the implementation. We now needed an additional layer of indirection that mapped hashes to bucket indices. Without this optimization, this mapping is essentially the identifiy function: the hash N always maps to the Nth bucket.

This pull request removes the optimization, resolving the problems that it introduced.

Fixes #2329

gensim/models/fasttext.py Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Show resolved Hide resolved
gensim/models/fasttext.py Outdated Show resolved Hide resolved
gensim/models/keyedvectors.py Outdated Show resolved Hide resolved
@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 21, 2019

@piskvorky When do you think would be the best time to release this?

We can sneak it into 3.7.2 (the next bugfix release) or wait until the next minor release (3.8.0). Technically, this isn't really a bugfix, it's a feature improvement (more precisely, the removal of a bad feature), so it can wait.

WDYT?

@piskvorky
Copy link
Owner

piskvorky commented Feb 21, 2019

I'd consider it a bug fix, and release as soon as possible. But #2371 NMF still blocking.

@mpenkov mpenkov added the 3.7.2 label Feb 21, 2019
@mpenkov mpenkov mentioned this pull request Feb 21, 2019
15 tasks
@mpenkov mpenkov merged commit b1850d9 into piskvorky:develop Mar 7, 2019
@mpenkov mpenkov deleted the rollback branch March 7, 2019 06:44
@gojomo
Copy link
Collaborator

gojomo commented Mar 7, 2019

Correcting a deviation from the behavior of Facebook's reference FastText implementation definitely seems like a bug fix to me!

And it would be good to include a release-note advisory about the changed behavior, now that gensim's FastText (post-fix) will always return a vector for OOV words (and never the "all ngrams for word X absent from model" error it was throwing previously).

@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 7, 2019

Already done in the changelog on develop :)

@gojomo
Copy link
Collaborator

gojomo commented Mar 9, 2019

@mpenkov I see a note about __contains__, but it's a little confusing. Isn't __contains__() and the Python built-in in operator supposed to give identical results? That's actually specified by the Python 3.7 documentation, but the current text of the changelog suggests gensim will do the opposite: always return True for __contains__(), but sometimes return False for in. I see now that the difference is checking the .wv.vocab property; that should be emphasized moreso than a __contains__ vs in approach.

I don't see a note that all [] lookups will now always return a vector, and never return the KeyError they sometimes did, for no available character n-grams, in gensim through 3.7.1. It would help to note this explicitly, and that this is being done to match the reference FastText behavior.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Mar 9, 2019

OK, please have a look at this commit and let me know if it is sufficient.

@gojomo
Copy link
Collaborator

gojomo commented Mar 9, 2019

Put an extensive note on the commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants