-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KeyedVectors refactor for word2vec #833
Conversation
This separate-file, take-as-much-generic-functionality-as-possible approach is what I think is the right approach. As before, I'd expect a role-based abbreviation ( Whether KeyedVectors should get the full Vocab functionality, including frequencies and (in The use of The main backward-compatibility I'd be concerned about would be the ability to load older saved models. (But. I wouldn't expect newer models, or older models once loaded, to be savable in a format that older code could load.) It'd be good to have a unit test of that, perhaps by bundling tiny saved-models in the test data. (Even though tiny models don't usually save their arrays as separate files, we'd want to test that mode, too.) I don't see the corresponding updates to the cython files, but maybe that's because the |
Thanks! Like @gojomo says, we have to be careful about maintaining backward compatibility, so we'll need tests for all appropriate combinations of {load/save, old/new model, single file/multiple mmap files}. Re. testing on smaller models -- there's a threshold that says how large internal arrays need to be before they're saved as separate, mmap-able files. This threshold is configurable: |
# set initial input/projection and hidden weights | ||
self.reset_weights() | ||
|
||
def sort_vocab(self): | ||
"""Sort the vocabulary so the most frequent words have the lowest indexes.""" | ||
if hasattr(self, 'syn0'): | ||
if self.kv.syn0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely sure this'll work as intended since right now, self.syn0
is initialized in reset_weights
, which is called after sort_vocab
. With kv
being initialized in __init__
, wouldn't this always raise an exception?
Possibly have a reset_weights
equivalent in KeyedVectors
? Or a check on self.syn0_lockf
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayantj self.kv.syn0
gets initialized in __init__
but isn't populated, so if the vectors/weights have not been initialized it will remain an empty list and an empty list as a conditional is False
so the exception will not be raised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, hadn't thought of that, sorry
# hasattr(self, "syn0") will return True but delattr(self, "syn0") will fail because it's | ||
# not a true attribute of Word2Vec, Word2Vec.__getattr__ just reroutes it to KeyedVectors' attributes | ||
if attrib in ["syn0", "syn0norm", "vocab", "index2word"]: | ||
if type(self) == "Word2Vec": # if saving a Word2Vec model, syn0, etc. are stored in self.kv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently doesn't work as intended because type(self)
returns a class, not a string. Ends up saving syn0norm
to the model file, increasing file size.
But even if changed to type(self).__name__
, this would call delete_attribute
with the Word2Vec object and syn0/syn0norm, and so when properties are being restored, syn0/syn0norm would be restored to the word2vec object instead of the KeyedVecs object.
The correct way to handle it seems via the recursive_saveloads
that is already implemented, but we'll still need a special condition check to ensure delete_attribute
isn't called on the word2vec object.
All this need for special condition checks is resulting from the __getattr__
patching to handle syn0, vocab and syn0norm backwards compatibility.
Even if we decide not to maintain syn0 and syn0norm backwards compatibility, vocab is definitely part of the API, and there doesn't seem to be any way to avoid special patching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or we could use property descriptors and add setters and deleters for syn0norm, vocab etc, and perform those operations on self.kv
. Wouldn't require any special patching anywhere, and we could add warnings that direct access to syn0/syn0norm would be removed in a future release.
I'd said before that use of Overall this is a big enough change that maintaining full API compatibility, down to direct-access of properties, adds complexity and risk compared to consciously breaking things. As a result, I'd suggest this either (1) hold to happen on a major-version rev where it's expected we might say, "your code must be updated"; or (2) happen in a new parallel class (like In any case, the main form of backward-compatibility to be maintained would be the ability to load/convert older saved models into the new format - ideally as a specific named function with all special-casing local, rather than as a series of conditionals/attribute-interceptions spread across many places. |
I've handled the special patching inside This way, backwards compatibility can be maintained for now without having a series of confusing conditionals distributed in the codebase. Also, it allows us to add warnings in case of direct syn0/syn0norm lookup (along the lines of, "syn0 access will no longer be supported in a future gensim release, please use model.kv.syn0 instead") All these changes are in #852 |
@jayantj I sent an invite to collaborate so you can push to it |
@anmol01gulati has kindly expressed interest in finishing this |
Conflicts: CHANGELOG.md gensim/models/word2vec.py
@anmol01gulati Thanks a lot for looking into it. That sounds right, should work. The test also seems to be failing for Python 2.7 though. Also this PR is from @droudy's fork, so I can't give you push access. @droudy could you please grant @anmol01gulati access? Regarding the break between Python 3.4 and 3.5, a fairly clean workaround for the pickle bug is doable, even though it isn't a bug with gensim. @tmylk do you think it'd be a good idea to have a fix for it, or is this too obscure? |
@anmol01gulati The 2.7 tests fail because of an unrelated known(though not investigated) wikitest glitch. Also feel free to create a new PR with this code as @droudy is busy till mid-november. It would be good to add the Python 3.4 fix inside the |
Merged in #980 |
Addresses #549. Refactored syn0, syn0norm, vocab, and index2word into their own class, as well as methods not involved in training that require vectors such as
most_similar()
. Maintains backwards compatibility so that calls such astrained_model.most_similar("the")
andtrained_model.syn0
don't break.If you are done with training, you can retrieve the keyedvectors of your trained model and save them to disk like so:
@gojomo please review
The changes to
utils.SaveLoad
in particular are a bit strange due to the need to maintain backwards compatibility, I can elaborate on them if the comments aren't comprehensive enough.