KeyedVectors refactor for word2vec #833

droudy · 2016-08-19T19:05:57Z

Addresses #549. Refactored syn0, syn0norm, vocab, and index2word into their own class, as well as methods not involved in training that require vectors such as most_similar(). Maintains backwards compatibility so that calls such as trained_model.most_similar("the") and trained_model.syn0 don't break.

If you are done with training, you can retrieve the keyedvectors of your trained model and save them to disk like so:

 
# kv --> KeyedVectors

retrieved_kv = trained_model.kv 
retrieved_kv.save("/dir/saved_vecs") 
# Later on

loaded_kv = KeyedVectors.load("/dir/saved_vecs")
loaded_kv.most_similar("the")

@gojomo please review
The changes to utils.SaveLoad in particular are a bit strange due to the need to maintain backwards compatibility, I can elaborate on them if the comments aren't comprehensive enough.

gojomo · 2016-08-19T21:42:45Z

This separate-file, take-as-much-generic-functionality-as-possible approach is what I think is the right approach.

As before, I'd expect a role-based abbreviation (wv) to be clearer than a type-based abbreviation (kv). There's the potential this class could also replace the DocvecsArray in Doc2Vec, in which case a model will have two KeyedVector properties. There's even the potential this could replace the syn1 layer (perhaps moreso in negative-sampling) and offer 'OUT' vector access-by-word-key. (See the 'Dual Embedding Space Model' papers from Mitra et al at MSFT for why that might be interesting.)

Whether KeyedVectors should get the full Vocab functionality, including frequencies and (in hs mode) huffman-coding-info), still seems a messy question to me. Not all KV users will have or need such info; on the other hand it makes sense to keep it close to the key-to-index lookup, for those activities that do need it. I wonder if KV should have an expandable property-set, perhaps as a dict of single-type arrays that are int-indexed in parallel with the vectors. (And if so, syn0_lockf-like per-slot values could also move into KV.) Unsure of right approach – just see this as a current area of unclean-separation-of-roles.

The use of __getattr__ patching to maintain property-access backward-compatibility makes me uncomfortable; it makes functionality more mysterious/magical, and adds confusing failure modes. I wouldn't say for sure that such direct-property access needs to be held compatible or should be considered part of the public API. (Perhaps code making such direct accesses should update itself when upgrading to a new gensim. Though alternatively, if we do consider such properties to be part of the API, and we strictly adhere to 'semantic versioning', such an API change would necessitate a major-version increment.)

The main backward-compatibility I'd be concerned about would be the ability to load older saved models. (But. I wouldn't expect newer models, or older models once loaded, to be savable in a format that older code could load.) It'd be good to have a unit test of that, perhaps by bundling tiny saved-models in the test data. (Even though tiny models don't usually save their arrays as separate files, we'd want to test that mode, too.)

I don't see the corresponding updates to the cython files, but maybe that's because the __getattr__ patching is keeping it working?

piskvorky · 2016-08-21T03:05:02Z

Thanks! Like @gojomo says, we have to be careful about maintaining backward compatibility, so we'll need tests for all appropriate combinations of {load/save, old/new model, single file/multiple mmap files}.

Re. testing on smaller models -- there's a threshold that says how large internal arrays need to be before they're saved as separate, mmap-able files. This threshold is configurable: model.save(sep_limit=0) (default sep_limit is 10MB).

jayantj · 2016-08-22T22:31:24Z

gensim/models/word2vec.py

        # set initial input/projection and hidden weights
        self.reset_weights()

    def sort_vocab(self):
        """Sort the vocabulary so the most frequent words have the lowest indexes."""
-        if hasattr(self, 'syn0'):
+        if self.kv.syn0:


I'm not completely sure this'll work as intended since right now, self.syn0 is initialized in reset_weights, which is called after sort_vocab. With kv being initialized in __init__, wouldn't this always raise an exception?
Possibly have a reset_weights equivalent in KeyedVectors? Or a check on self.syn0_lockf?

@jayantj self.kv.syn0 gets initialized in __init__ but isn't populated, so if the vectors/weights have not been initialized it will remain an empty list and an empty list as a conditional is False so the exception will not be raised

Ah, yes, hadn't thought of that, sorry

jayantj · 2016-09-08T14:25:32Z

gensim/utils.py

+            # hasattr(self, "syn0") will return True but delattr(self, "syn0") will fail because it's
+            # not a true attribute of Word2Vec, Word2Vec.__getattr__ just reroutes it to KeyedVectors' attributes
+            if attrib in ["syn0", "syn0norm", "vocab", "index2word"]:
+                if type(self) == "Word2Vec":  # if saving a Word2Vec model, syn0, etc. are stored in self.kv


This currently doesn't work as intended because type(self) returns a class, not a string. Ends up saving syn0norm to the model file, increasing file size.
But even if changed to type(self).__name__, this would call delete_attribute with the Word2Vec object and syn0/syn0norm, and so when properties are being restored, syn0/syn0norm would be restored to the word2vec object instead of the KeyedVecs object.

The correct way to handle it seems via the recursive_saveloads that is already implemented, but we'll still need a special condition check to ensure delete_attribute isn't called on the word2vec object.

All this need for special condition checks is resulting from the __getattr__ patching to handle syn0, vocab and syn0norm backwards compatibility.

Even if we decide not to maintain syn0 and syn0norm backwards compatibility, vocab is definitely part of the API, and there doesn't seem to be any way to avoid special patching.

Or we could use property descriptors and add setters and deleters for syn0norm, vocab etc, and perform those operations on self.kv. Wouldn't require any special patching anywhere, and we could add warnings that direct access to syn0/syn0norm would be removed in a future release.

gojomo · 2016-09-09T17:31:08Z

I'd said before that use of __getattr__ patching to maintain property-access backward-compatibility made me uncomfortable; I would like to upgrade that to "I think it's a really bad idea". Similarly, the special-casing of things by class/attribute name inside utils.py for save/load seems very fragile/delocalized.

Overall this is a big enough change that maintaining full API compatibility, down to direct-access of properties, adds complexity and risk compared to consciously breaking things. As a result, I'd suggest this either (1) hold to happen on a major-version rev where it's expected we might say, "your code must be updated"; or (2) happen in a new parallel class (like NewWord2Vec or future.Word2Vec. In case (2), pre- and post-refactor code can be tested and verified as behaving identically within the same build, and advanced users can opt to use to the new API knowing some code tweaks will be necessary. Later, when there's a major-version rev, the old class can swapped out for the new class, forcing users' code to adjust only with that major-version rev.

In any case, the main form of backward-compatibility to be maintained would be the ability to load/convert older saved models into the new format - ideally as a specific named function with all special-casing local, rather than as a series of conditionals/attribute-interceptions spread across many places.

jayantj · 2016-09-09T17:39:36Z

I've handled the special patching inside utils.py and __getattr__ inside an overriden _load_specials in word2vec, and with the use of property descriptors for syn0, syn0norm, vocab and index2word.

This way, backwards compatibility can be maintained for now without having a series of confusing conditionals distributed in the codebase. Also, it allows us to add warnings in case of direct syn0/syn0norm lookup (along the lines of, "syn0 access will no longer be supported in a future gensim release, please use model.kv.syn0 instead")

All these changes are in #852
@droudy if you can give me access to this branch, I can push all my code to this PR and keep all code and discussion here

droudy · 2016-09-10T00:38:20Z

@jayantj I sent an invite to collaborate so you can push to it

jayantj · 2016-09-12T05:52:47Z

Thanks @droudy, pushed.
Also, the build seems to be failing due to #853

…g in saving

…cython files

tmylk · 2016-10-04T14:03:55Z

@anmol01gulati has kindly expressed interest in finishing this

Conflicts: CHANGELOG.md gensim/models/word2vec.py

anmolgulati · 2016-10-25T09:33:53Z

@jayantj I worked on this. So, your tests fail as there is a break issue between Python 3.4 and 3.5 as well. See here
So, you would want to add another model file explicitly for Python 3.4 and 3.5 and it would work.
If you'd give me access to push to this branch, I could add it myself as well.

jayantj · 2016-10-25T09:54:26Z

@anmol01gulati Thanks a lot for looking into it. That sounds right, should work. The test also seems to be failing for Python 2.7 though.

Also this PR is from @droudy's fork, so I can't give you push access. @droudy could you please grant @anmol01gulati access?

Regarding the break between Python 3.4 and 3.5, a fairly clean workaround for the pickle bug is doable, even though it isn't a bug with gensim. @tmylk do you think it'd be a good idea to have a fix for it, or is this too obscure?

tmylk · 2016-10-25T13:42:22Z

@anmol01gulati The 2.7 tests fail because of an unrelated known(though not investigated) wikitest glitch.

Also feel free to create a new PR with this code as @droudy is busy till mid-november.

It would be good to add the Python 3.4 fix inside the if statement to make sure it only affects the Python 3.4 version.

tmylk · 2016-12-22T01:44:04Z

Merged in #980

droudy mentioned this pull request Aug 19, 2016

NamedVectors refactor for word2vec #819

Closed

piskvorky assigned tmylk Aug 21, 2016

jayantj reviewed Aug 22, 2016
View reviewed changes

jayantj mentioned this pull request Sep 1, 2016

[MRG] Wrapper for FastText #847

Merged

jayantj reviewed Sep 8, 2016
View reviewed changes

jayantj mentioned this pull request Sep 9, 2016

Keyedvecs updates #852

Closed

jayantj force-pushed the keyedvecs branch from 5f98565 to 787cc98 Compare September 12, 2016 13:19

droudy and others added 13 commits September 12, 2016 19:06

updated refactor

55a4fc9

commit missed file

e916f7e

docstring added

e5416ed

more refactoring

e64766b

add missing docstring

c34cf37

fix docstring format

c9b31f9

clearer docstring

a0329af

minor typo in word2vec wmdistance

0c0e2fa

pyemd error in keyedvecs

cdefeb0

relative import of keyedvecs from word2vec fails

1aec5a2

bug in init_sims in word2vec

e7368a3

property descriptors for syn0, syn0norm, index2word, vocab - fixes bu…

fe283c2

…g in saving

tests for loading older word2vec models

9b36bc4

jayantj added 12 commits September 12, 2016 19:06

backwards compatibility for loading older models

dfe1893

test for syn0norm not saved to file

4a03f20

syn0norm not saved to file for KeyedVectors

09b6ebe

tests and fix for accuracy

7df4138

minor bug in finalized vocab check

4c54d9b

warnings for direct syn0/syn0norm access

a28f9f1

fixes use of most_similar in accuracy

bf1182e

changes logging level to ERROR in word2vec tests

5a6b97b

renames kv to wv in word2vec

cfb2e1c

minor bugs with checking existence of syn0

b002765

replaces syn0 and syn0norm with wv.syn0 and wv.syn0norm in tests and …

27c0a14

…cython files

adds changelog

81f8cbb

jayantj force-pushed the keyedvecs branch from 30a0031 to 81f8cbb Compare September 12, 2016 13:37

tmylk added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills and removed feature Issue described a new feature labels Oct 4, 2016

jayantj force-pushed the keyedvecs branch from 3308a93 to 81f8cbb Compare October 16, 2016 03:31

jayantj added 2 commits October 16, 2016 10:40

Merge branch 'develop' into keyedvecs

7f98c8d

Conflicts: CHANGELOG.md gensim/models/word2vec.py

updates tests for loading word2vec models for different python versions

1b282ab

jayantj force-pushed the keyedvecs branch from 2b3f1f6 to 1b282ab Compare October 16, 2016 08:53

anmolgulati mentioned this pull request Oct 26, 2016

KeyedVecs refactoring for word2vec #980

Merged

tmylk closed this Dec 22, 2016

gojomo mentioned this pull request Jan 4, 2017

Save and load methods generate KeyedVector warnings word2vec and doc2vec #1069

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KeyedVectors refactor for word2vec #833

KeyedVectors refactor for word2vec #833

droudy commented Aug 19, 2016

gojomo commented Aug 19, 2016

piskvorky commented Aug 21, 2016

jayantj Aug 22, 2016 •

edited

Loading

droudy Aug 23, 2016 •

edited

Loading

jayantj Aug 23, 2016

jayantj Sep 8, 2016 •

edited

Loading

jayantj Sep 8, 2016 •

edited

Loading

gojomo commented Sep 9, 2016

jayantj commented Sep 9, 2016 •

edited

Loading

droudy commented Sep 10, 2016

jayantj commented Sep 12, 2016

tmylk commented Oct 4, 2016

anmolgulati commented Oct 25, 2016 •

edited

Loading

jayantj commented Oct 25, 2016 •

edited

Loading

tmylk commented Oct 25, 2016

tmylk commented Dec 22, 2016

KeyedVectors refactor for word2vec #833

KeyedVectors refactor for word2vec #833

Conversation

droudy commented Aug 19, 2016

gojomo commented Aug 19, 2016

piskvorky commented Aug 21, 2016

jayantj Aug 22, 2016 • edited Loading

Choose a reason for hiding this comment

droudy Aug 23, 2016 • edited Loading

Choose a reason for hiding this comment

jayantj Aug 23, 2016

Choose a reason for hiding this comment

jayantj Sep 8, 2016 • edited Loading

Choose a reason for hiding this comment

jayantj Sep 8, 2016 • edited Loading

Choose a reason for hiding this comment

gojomo commented Sep 9, 2016

jayantj commented Sep 9, 2016 • edited Loading

droudy commented Sep 10, 2016

jayantj commented Sep 12, 2016

tmylk commented Oct 4, 2016

anmolgulati commented Oct 25, 2016 • edited Loading

jayantj commented Oct 25, 2016 • edited Loading

tmylk commented Oct 25, 2016

tmylk commented Dec 22, 2016

jayantj Aug 22, 2016 •

edited

Loading

droudy Aug 23, 2016 •

edited

Loading

jayantj Sep 8, 2016 •

edited

Loading

jayantj Sep 8, 2016 •

edited

Loading

jayantj commented Sep 9, 2016 •

edited

Loading

anmolgulati commented Oct 25, 2016 •

edited

Loading

jayantj commented Oct 25, 2016 •

edited

Loading