Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save and load methods generate KeyedVector warnings word2vec and doc2vec #1069

Closed
tmylk opened this issue Jan 3, 2017 · 8 comments
Closed
Labels
bug Issue described a bug

Comments

@tmylk
Copy link
Contributor

tmylk commented Jan 3, 2017

Introduced in #980
CC @anmol01gulati

model.save(temp_path)
WARNING:gensim.models.word2vec:direct access to syn0norm will not be supported in future gensim releases, please use model.wv.syn0norm
WARNING:gensim.models.word2vec:direct access to syn0norm will not be supported in future gensim releases, please use model.wv.syn0norm
WARNING:gensim.models.word2vec:direct access to syn0norm will not be supported in future gensim releases, please use model.wv.syn0norm
WARNING:gensim.models.word2vec:direct access to syn0norm will not be supported in future gensim releases, please use model.wv.syn0norm

new_model = gensim.models.Word2Vec.load(temp_path)  # open the model
WARNING:gensim.models.word2vec:direct access to syn0norm will not be supported in future gensim releases, please use model.wv.syn0norm
WARNING:gensim.models.word2vec:direct access to index2word will not be supported in future gensim releases, please use model.wv.index2word
@tmylk tmylk added the bug Issue described a bug label Jan 3, 2017
@tmylk
Copy link
Contributor Author

tmylk commented Jan 4, 2017

The issue is partially fixed by #1072
There is still a problem that after introducing KeyedVectors now syn0, vocab, id2word are saved TWO times. Once in word2vec and once in keyedvectors. After direct access is deprecated it will be only once.

In the mean time we can add an ignore_sub parameter to utils.save_specials where it handles recursive saves. That way syn0norm and syn0 will be ignored from word2vec pickle but saved in KeyedVectors pickle.

@gojomo
Copy link
Collaborator

gojomo commented Jan 4, 2017

Rather than silencing the warnings, via a new warning-enablement state/API, shouldn't the save/load code just not do the warned-against actions? (An API where a SaveLoad object self-reports its eligible properties would make more sense than a hide-warnings workaround.)

(Such complications are why my preference, as per comment at #833 (comment) , was to avoid backward-compatibility tricks by keeping the refactor in a parallel class and/or holding it for an API-breaking major release.)

@tmylk
Copy link
Contributor Author

tmylk commented Jan 4, 2017

Current behaviour of SaveLoad API is to save everything except ignored fields. To change it to save what is explicitly given will take a while. For example even calling 'hasattr' on 'model.vocab' raises a warning. The silencing will be removed in the next API breaking release

@gojomo
Copy link
Collaborator

gojomo commented Jan 5, 2017

My hunch (which could be wrong) is that generically-useful improvements to SaveLoad would fix this with fewer lines of code than this cludgey new warnings-suppression capability. For example, SaveLoad might define a method like:

def savables(self): 
    return iteritems(self.__dict__)

Its other methods that currently use iteritems(self.__dict__) directly would use this instead. And a subclass (like Word2Vec) that has local knowledge that some of its (faked) attributes shouldn't really be included in SaveLoad's list-of-savables (because it will result in wasteful double-saving or unhelpful warnings) would override savables() to prevent the wasteful activity, not just (temporarily) hide it.

@tmylk
Copy link
Contributor Author

tmylk commented Jan 6, 2017

Agree with @gojomo. What do you think, @anmol01gulati?

@jayantj
Copy link
Contributor

jayantj commented Jan 10, 2017

Actually, I looked a little bit into this, because I found it surprising syn0norm and index2word are still accessed directly from Word2Vec. Even if Word2Vec saves all attributes except the ones in ignore, iteritems(self.__dict__) doesn't even return syn0norm and index2word.

The reason they end up being accessed is these two snippets of code -

   @classmethod
    def load(cls, *args, **kwargs):
        Word2Vec.disable_keyed_vectors_warnings()
        model = super(Word2Vec, cls).load(*args, **kwargs)
        # update older models
        if hasattr(model, 'table'):
            delattr(model, 'table')  # discard in favor of cum_table
        if model.negative and hasattr(model, 'index2word'):
            model.make_cum_table()  # rebuild cum_table from vocabulary
        ...
        ...

and

    def save(self, *args, **kwargs):
        # don't bother storing the cached normalized vectors, recalculable table
        # TODO: after introducing KeyedVectors now syn0, vocab, id2word are saved TWO times. Once in word2vec and once in keyedvectors
        #       After keyedvectors are deprecated it will be only once
        Word2Vec.disable_keyed_vectors_warnings()
        kwargs['ignore'] = kwargs.get('ignore', ['syn0norm', 'table', 'cum_table'])
        ...
        ...

The first should be easily fixable (by changing to hasattr(model.wv, 'index2word')).

The second one is slightly trickier, because recursive SaveLoads are handled a little unintuitively.
If I simply remove syn0norm from this line -

kwargs['ignore'] = kwargs.get('ignore', ['syn0norm', 'table', 'cum_table'])

in Word2Vec, and add it to KeyedVecs instead, it doesn't work as intended, because Word2Vec never calls the save method of KeyedVecs while saving KeyedVecs. It only calls an internal method _save_specials.

The best fix, IMO, would be to make the default ignore fields for both KeyedVecs and Word2Vec a class variable instead of the current hack inside save, and use that in either @gojomo 's proposed savables method, or in the current handling of ignore in _save_specials.

@somah1411
Copy link

can please explaine more about this problem and where can i but this hasattr(model.wv, 'index2word')).

@tmylk
Copy link
Contributor Author

tmylk commented Feb 17, 2017

This is resolved by deprecating direct access to KeyedVectors in #1147

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug
Projects
None yet
Development

No branches or pull requests

4 participants