Undo the hash2index optimization #2370

mpenkov · 2019-02-03T07:40:01Z

This optimization reduced the number of ngram buckets to include only ngrams that we have seen during training.

This seemed like a good idea at the time, because it saved CPU cycles and RAM, but turned out to be a bad idea, because it introduced divergent behavior when compared to the reference implementation. For example:

We were unable to calculate vectors for terms that were completely out of the vocab (so the term and all its ngrams were unseen). This is bad because the original FB implementation always returns a vector. It may seem useless because it's initialized to a random vector, but that's not entirely true, because that vector is random at initialization time. When we're querying the ngram's vector, the vector is deterministic, so it is useful.

Another problem is that it complicated the implementation. We now needed an additional layer of indirection that mapped hashes to bucket indices. Without this optimization, this mapping is essentially the identifiy function: the hash N always maps to the Nth bucket.

This pull request removes the optimization, resolving the problems that it introduced.

Fixes #2329

gensim/models/fasttext.py

gensim/models/keyedvectors.py

Co-Authored-By: mpenkov <[email protected]>

gensim/models/fasttext.py

gensim/models/keyedvectors.py

mpenkov · 2019-02-21T01:34:33Z

@piskvorky When do you think would be the best time to release this?

We can sneak it into 3.7.2 (the next bugfix release) or wait until the next minor release (3.8.0). Technically, this isn't really a bugfix, it's a feature improvement (more precisely, the removal of a bad feature), so it can wait.

WDYT?

piskvorky · 2019-02-21T10:08:52Z

I'd consider it a bug fix, and release as soon as possible. But #2371 NMF still blocking.

gojomo · 2019-03-07T15:41:24Z

Correcting a deviation from the behavior of Facebook's reference FastText implementation definitely seems like a bug fix to me!

And it would be good to include a release-note advisory about the changed behavior, now that gensim's FastText (post-fix) will always return a vector for OOV words (and never the "all ngrams for word X absent from model" error it was throwing previously).

mpenkov · 2019-03-07T22:30:17Z

Already done in the changelog on develop :)

gojomo · 2019-03-09T07:15:49Z

@mpenkov I see a note about __contains__, but it's a little confusing. Isn't __contains__() and the Python built-in in operator supposed to give identical results? That's actually specified by the Python 3.7 documentation, but the current text of the changelog suggests gensim will do the opposite: always return True for __contains__(), but sometimes return False for in. I see now that the difference is checking the .wv.vocab property; that should be emphasized moreso than a __contains__ vs in approach.

I don't see a note that all [] lookups will now always return a vector, and never return the KeyError they sometimes did, for no available character n-grams, in gensim through 3.7.1. It would help to note this explicitly, and that this is being done to match the reference FastText behavior.

mpenkov · 2019-03-09T12:26:47Z

OK, please have a look at this commit and let me know if it is sufficient.

gojomo · 2019-03-09T22:41:23Z

Put an extensive note on the commit.

mpenkov added 5 commits February 3, 2019 18:32

WIP

e2750f1

flake8

ed52a5f

implement memory-efficient _unpack function

8fd9f81

fixup

1423298

move tests, add test_identity

0604de2

mpenkov requested a review from piskvorky February 6, 2019 23:22

piskvorky requested changes Feb 7, 2019

View reviewed changes

gensim/models/fasttext.py Show resolved Hide resolved

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

gensim/models/keyedvectors.py Show resolved Hide resolved

piskvorky and others added 2 commits February 7, 2019 21:04

Update gensim/models/keyedvectors.py

a52caba

Co-Authored-By: mpenkov <[email protected]>

review response: improve comment

009db28

piskvorky requested changes Feb 7, 2019

View reviewed changes

gensim/models/fasttext.py Outdated Show resolved Hide resolved

gensim/models/keyedvectors.py Outdated Show resolved Hide resolved

mpenkov added 2 commits February 13, 2019 13:34

review response: fix redundant log message

5a6eda0

review response: remove sphinx markup from log message

2625401

mpenkov added the 3.7.2 label Feb 21, 2019

mpenkov mentioned this pull request Feb 21, 2019

NMF metrics and wikipedia #2371

Merged

15 tasks

remove FIXME

1cb15fd

mpenkov merged commit b1850d9 into piskvorky:develop Mar 7, 2019

mpenkov deleted the rollback branch March 7, 2019 06:44

gojomo mentioned this pull request Mar 14, 2019

Inference issue using FB pretrained model if word have no ngrams #2415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undo the hash2index optimization #2370

Undo the hash2index optimization #2370

mpenkov commented Feb 3, 2019 •

edited

Loading

mpenkov commented Feb 21, 2019

piskvorky commented Feb 21, 2019 •

edited

Loading

gojomo commented Mar 7, 2019

mpenkov commented Mar 7, 2019

gojomo commented Mar 9, 2019 •

edited

Loading

mpenkov commented Mar 9, 2019

gojomo commented Mar 9, 2019

Undo the hash2index optimization #2370

Undo the hash2index optimization #2370

Conversation

mpenkov commented Feb 3, 2019 • edited Loading

mpenkov commented Feb 21, 2019

piskvorky commented Feb 21, 2019 • edited Loading

gojomo commented Mar 7, 2019

mpenkov commented Mar 7, 2019

gojomo commented Mar 9, 2019 • edited Loading

mpenkov commented Mar 9, 2019

gojomo commented Mar 9, 2019

mpenkov commented Feb 3, 2019 •

edited

Loading

piskvorky commented Feb 21, 2019 •

edited

Loading

gojomo commented Mar 9, 2019 •

edited

Loading