Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyedVectors refactor for word2vec #833

Closed
wants to merge 27 commits into from
Closed

Conversation

droudy
Copy link
Contributor

@droudy droudy commented Aug 19, 2016

Addresses #549. Refactored syn0, syn0norm, vocab, and index2word into their own class, as well as methods not involved in training that require vectors such as most_similar(). Maintains backwards compatibility so that calls such as trained_model.most_similar("the") and trained_model.syn0 don't break.

If you are done with training, you can retrieve the keyedvectors of your trained model and save them to disk like so:

 
# kv --> KeyedVectors

retrieved_kv = trained_model.kv 
retrieved_kv.save("/dir/saved_vecs") 
# Later on

loaded_kv = KeyedVectors.load("/dir/saved_vecs")
loaded_kv.most_similar("the")

@gojomo please review
The changes to utils.SaveLoad in particular are a bit strange due to the need to maintain backwards compatibility, I can elaborate on them if the comments aren't comprehensive enough.

@gojomo
Copy link
Collaborator

gojomo commented Aug 19, 2016

This separate-file, take-as-much-generic-functionality-as-possible approach is what I think is the right approach.

As before, I'd expect a role-based abbreviation (wv) to be clearer than a type-based abbreviation (kv). There's the potential this class could also replace the DocvecsArray in Doc2Vec, in which case a model will have two KeyedVector properties. There's even the potential this could replace the syn1 layer (perhaps moreso in negative-sampling) and offer 'OUT' vector access-by-word-key. (See the 'Dual Embedding Space Model' papers from Mitra et al at MSFT for why that might be interesting.)

Whether KeyedVectors should get the full Vocab functionality, including frequencies and (in hs mode) huffman-coding-info), still seems a messy question to me. Not all KV users will have or need such info; on the other hand it makes sense to keep it close to the key-to-index lookup, for those activities that do need it. I wonder if KV should have an expandable property-set, perhaps as a dict of single-type arrays that are int-indexed in parallel with the vectors. (And if so, syn0_lockf-like per-slot values could also move into KV.) Unsure of right approach – just see this as a current area of unclean-separation-of-roles.

The use of __getattr__ patching to maintain property-access backward-compatibility makes me uncomfortable; it makes functionality more mysterious/magical, and adds confusing failure modes. I wouldn't say for sure that such direct-property access needs to be held compatible or should be considered part of the public API. (Perhaps code making such direct accesses should update itself when upgrading to a new gensim. Though alternatively, if we do consider such properties to be part of the API, and we strictly adhere to 'semantic versioning', such an API change would necessitate a major-version increment.)

The main backward-compatibility I'd be concerned about would be the ability to load older saved models. (But. I wouldn't expect newer models, or older models once loaded, to be savable in a format that older code could load.) It'd be good to have a unit test of that, perhaps by bundling tiny saved-models in the test data. (Even though tiny models don't usually save their arrays as separate files, we'd want to test that mode, too.)

I don't see the corresponding updates to the cython files, but maybe that's because the __getattr__ patching is keeping it working?

@piskvorky
Copy link
Owner

Thanks! Like @gojomo says, we have to be careful about maintaining backward compatibility, so we'll need tests for all appropriate combinations of {load/save, old/new model, single file/multiple mmap files}.

Re. testing on smaller models -- there's a threshold that says how large internal arrays need to be before they're saved as separate, mmap-able files. This threshold is configurable: model.save(sep_limit=0) (default sep_limit is 10MB).

# set initial input/projection and hidden weights
self.reset_weights()

def sort_vocab(self):
"""Sort the vocabulary so the most frequent words have the lowest indexes."""
if hasattr(self, 'syn0'):
if self.kv.syn0:
Copy link
Contributor

@jayantj jayantj Aug 22, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure this'll work as intended since right now, self.syn0 is initialized in reset_weights, which is called after sort_vocab. With kv being initialized in __init__, wouldn't this always raise an exception?
Possibly have a reset_weights equivalent in KeyedVectors? Or a check on self.syn0_lockf?

Copy link
Contributor Author

@droudy droudy Aug 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayantj self.kv.syn0 gets initialized in __init__ but isn't populated, so if the vectors/weights have not been initialized it will remain an empty list and an empty list as a conditional is False so the exception will not be raised

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, hadn't thought of that, sorry

# hasattr(self, "syn0") will return True but delattr(self, "syn0") will fail because it's
# not a true attribute of Word2Vec, Word2Vec.__getattr__ just reroutes it to KeyedVectors' attributes
if attrib in ["syn0", "syn0norm", "vocab", "index2word"]:
if type(self) == "Word2Vec": # if saving a Word2Vec model, syn0, etc. are stored in self.kv
Copy link
Contributor

@jayantj jayantj Sep 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently doesn't work as intended because type(self) returns a class, not a string. Ends up saving syn0norm to the model file, increasing file size.
But even if changed to type(self).__name__, this would call delete_attribute with the Word2Vec object and syn0/syn0norm, and so when properties are being restored, syn0/syn0norm would be restored to the word2vec object instead of the KeyedVecs object.

The correct way to handle it seems via the recursive_saveloads that is already implemented, but we'll still need a special condition check to ensure delete_attribute isn't called on the word2vec object.

All this need for special condition checks is resulting from the __getattr__ patching to handle syn0, vocab and syn0norm backwards compatibility.

Even if we decide not to maintain syn0 and syn0norm backwards compatibility, vocab is definitely part of the API, and there doesn't seem to be any way to avoid special patching.

Copy link
Contributor

@jayantj jayantj Sep 8, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we could use property descriptors and add setters and deleters for syn0norm, vocab etc, and perform those operations on self.kv. Wouldn't require any special patching anywhere, and we could add warnings that direct access to syn0/syn0norm would be removed in a future release.

@jayantj jayantj mentioned this pull request Sep 9, 2016
@gojomo
Copy link
Collaborator

gojomo commented Sep 9, 2016

I'd said before that use of __getattr__ patching to maintain property-access backward-compatibility made me uncomfortable; I would like to upgrade that to "I think it's a really bad idea". Similarly, the special-casing of things by class/attribute name inside utils.py for save/load seems very fragile/delocalized.

Overall this is a big enough change that maintaining full API compatibility, down to direct-access of properties, adds complexity and risk compared to consciously breaking things. As a result, I'd suggest this either (1) hold to happen on a major-version rev where it's expected we might say, "your code must be updated"; or (2) happen in a new parallel class (like NewWord2Vec or future.Word2Vec. In case (2), pre- and post-refactor code can be tested and verified as behaving identically within the same build, and advanced users can opt to use to the new API knowing some code tweaks will be necessary. Later, when there's a major-version rev, the old class can swapped out for the new class, forcing users' code to adjust only with that major-version rev.

In any case, the main form of backward-compatibility to be maintained would be the ability to load/convert older saved models into the new format - ideally as a specific named function with all special-casing local, rather than as a series of conditionals/attribute-interceptions spread across many places.

@jayantj
Copy link
Contributor

jayantj commented Sep 9, 2016

I've handled the special patching inside utils.py and __getattr__ inside an overriden _load_specials in word2vec, and with the use of property descriptors for syn0, syn0norm, vocab and index2word.

This way, backwards compatibility can be maintained for now without having a series of confusing conditionals distributed in the codebase. Also, it allows us to add warnings in case of direct syn0/syn0norm lookup (along the lines of, "syn0 access will no longer be supported in a future gensim release, please use model.kv.syn0 instead")

All these changes are in #852
@droudy if you can give me access to this branch, I can push all my code to this PR and keep all code and discussion here

@droudy
Copy link
Contributor Author

droudy commented Sep 10, 2016

@jayantj I sent an invite to collaborate so you can push to it

@jayantj
Copy link
Contributor

jayantj commented Sep 12, 2016

Thanks @droudy, pushed.
Also, the build seems to be failing due to #853

@tmylk tmylk added feature Issue described a new feature difficulty hard Hard issue: required deep gensim understanding & high python/cython skills and removed feature Issue described a new feature labels Oct 4, 2016
@tmylk
Copy link
Contributor

tmylk commented Oct 4, 2016

@anmol01gulati has kindly expressed interest in finishing this

@anmolgulati
Copy link
Contributor

anmolgulati commented Oct 25, 2016

@jayantj I worked on this. So, your tests fail as there is a break issue between Python 3.4 and 3.5 as well. See here
So, you would want to add another model file explicitly for Python 3.4 and 3.5 and it would work.
If you'd give me access to push to this branch, I could add it myself as well.

@jayantj
Copy link
Contributor

jayantj commented Oct 25, 2016

@anmol01gulati Thanks a lot for looking into it. That sounds right, should work. The test also seems to be failing for Python 2.7 though.

Also this PR is from @droudy's fork, so I can't give you push access. @droudy could you please grant @anmol01gulati access?

Regarding the break between Python 3.4 and 3.5, a fairly clean workaround for the pickle bug is doable, even though it isn't a bug with gensim. @tmylk do you think it'd be a good idea to have a fix for it, or is this too obscure?

@tmylk
Copy link
Contributor

tmylk commented Oct 25, 2016

@anmol01gulati The 2.7 tests fail because of an unrelated known(though not investigated) wikitest glitch.

Also feel free to create a new PR with this code as @droudy is busy till mid-november.

It would be good to add the Python 3.4 fix inside the if statement to make sure it only affects the Python 3.4 version.

@tmylk
Copy link
Contributor

tmylk commented Dec 22, 2016

Merged in #980

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty hard Hard issue: required deep gensim understanding & high python/cython skills
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants