Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make negative ns_exponent work correctly #3250

Merged
merged 8 commits into from
Oct 28, 2021

Conversation

menshikh-iv
Copy link
Contributor

Fix #3232

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 23, 2021

Now we're waiting until current tests fail on 3 new cases (w2v, d2v, ft), see 32da953 commit

My local test exec report

(gensim) ivan@T490:~/release/gensim$ tox -e py38-linux gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
py38-linux create: /home/ivan/release/gensim/.tox/py38-linux
py38-linux installdeps: pip>=19.1.1, .[test]
py38-linux installed: attrs==21.2.0,certifi==2021.10.8,charset-normalizer==2.0.7,Cython==0.29.24,gensim @ file:///home/ivan/release/gensim,idna==3.3,iniconfig==1.1.1,jsonpatch==1.32,jsonpointer==2.1,mock==4.0.3,Morfessor==2.0.6,nmslib==2.1.1,numpy==1.21.3,packaging==21.0,Pillow==8.4.0,pluggy==1.0.0,psutil==5.8.0,py==1.10.0,pybind11==2.6.1,pyemd==0.5.1,pyparsing==2.4.7,pytest==6.2.5,pyzmq==22.3.0,requests==2.26.0,scipy==1.7.1,six==1.16.0,smart-open==5.2.1,testfixtures==6.18.3,toml==0.10.2,torchfile==0.1.0,tornado==6.1,urllib3==1.26.7,visdom==0.1.8.9,websocket-client==1.2.1
py38-linux run-test-pre: PYTHONHASHSEED='1'
py38-linux run-test: commands[0] | python --version
Python 3.8.10
py38-linux run-test: commands[1] | pip --version
pip 21.2.4 from /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/pip (python 3.8)
py38-linux run-test: commands[2] | python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/corpora/_mmreader.cpython-38-x86_64-linux-gnu.so -> gensim/corpora
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/_matutils.cpython-38-x86_64-linux-gnu.so -> gensim
copying build/lib.linux-x86_64-3.8/gensim/models/nmf_pgd.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/similarities/fastss.cpython-38-x86_64-linux-gnu.so -> gensim/similarities
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
py38-linux run-test: commands[3] | pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
================================================================= test session starts =================================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
cachedir: .tox/py38-linux/.pytest_cache
rootdir: /home/ivan/release/gensim, configfile: tox.ini
collected 281 items                                                                                                                                   

gensim/test/test_word2vec.py ..........................sF....................................................                                   [ 28%]
gensim/test/test_doc2vec.py ............................s.....F...........                                                                      [ 44%]
gensim/test/test_fasttext.py ..................F..........................................................ssss.............................sF.. [ 85%]
.........................................                                                                                                                                                                   [100%]

==================================================================================================== FAILURES =====================================================================================================
_____________________________________________________________________________________ TestWord2VecModel.test_negative_ns_exp ______________________________________________________________________________________

self = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>

    def test_negative_ns_exp(self):
        # We expect that model should train, save, load and continue training without any exceptions
>       model = word2vec.Word2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)

self       = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>

gensim/test/test_word2vec.py:1059: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/word2vec.py:425: in __init__
    self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
        alpha      = 0.025
        batch_words = 10000
        callbacks  = ()
        cbow_mean  = 1
        comment    = None
        compute_loss = False
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        epochs     = 5
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
        sentences  = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        workers    = 1
gensim/models/word2vec.py:491: in build_vocab
    report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
        corpus_count = 9
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        keep_raw_vocab = False
        kwargs     = {}
        progress_per = 10000
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
        total_words = 29
        trim_rule  = None
        update     = False
gensim/models/word2vec.py:772: in prepare_vocab
    self.make_cum_table()
        downsample_total = 3.5001157321504532
        downsample_unique = 12
        drop_total = 0
        drop_unique = 0
        dry_run    = False
        keep_raw_vocab = False
        min_count  = 1
        original_total = 29
        original_unique_total = 12
        report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
        retain_pct = 100.0
        retain_total = 29
        retain_unique_pct = 100.0
        retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
        sample     = 0.001
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
        threshold_count = 0.029
        trim_rule  = None
        update     = False
        v          = 2
        w          = 'minors'
        word       = 'minors'
        word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>, domain = 2147483647

    def make_cum_table(self, domain=2**31 - 1):
        """Create a cumulative-distribution table using stored vocabulary word counts for
        drawing random words in the negative-sampling training routines.
    
        To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
        then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
        That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
    
        """
        vocab_size = len(self.wv.index_to_key)
        self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
        # compute sum of all power (Z in paper)
        train_words_pow = 0.0
        for word_index in range(vocab_size):
            count = self.wv.get_vecattr(word_index, 'count')
>           train_words_pow += count**self.ns_exponent
E           ValueError: Integers to negative integer powers are not allowed.

count      = 4
domain     = 2147483647
self       = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
train_words_pow = 0.0
vocab_size = 12
word_index = 0

gensim/models/word2vec.py:836: ValueError
______________________________________________________________________________________ TestDoc2VecModel.test_negative_ns_exp ______________________________________________________________________________________

self = <gensim.test.test_doc2vec.TestDoc2VecModel testMethod=test_negative_ns_exp>

    def test_negative_ns_exp(self):
        # We expect that model should train, save, load and continue training without any exceptions
>       model = doc2vec.Doc2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)

self       = <gensim.test.test_doc2vec.TestDoc2VecModel testMethod=test_negative_ns_exp>

gensim/test/test_doc2vec.py:726: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/doc2vec.py:294: in __init__
    super(Doc2Vec, self).__init__(
        __class__  = <class 'gensim.models.doc2vec.Doc2Vec'>
        callbacks  = ()
        comment    = None
        corpus_file = None
        corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        dbow_words = 0
        dm         = 1
        dm_concat  = 0
        dm_mean    = None
        dm_tag_count = 1
        documents  = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        dv         = None
        dv_mapfile = None
        epochs     = 10
        kwargs     = {'min_count': 1, 'ns_exponent': -1, 'workers': 1}
        self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
        shrink_windows = True
        trim_rule  = None
        vector_size = 100
        window     = 5
gensim/models/word2vec.py:425: in __init__
    self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
        alpha      = 0.025
        batch_words = 10000
        callbacks  = ()
        cbow_mean  = 1
        comment    = None
        compute_loss = False
        corpus_file = None
        corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        epochs     = 10
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
        sentences  = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        workers    = 1
gensim/models/doc2vec.py:884: in build_vocab
    report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
        corpus_count = 9
        corpus_file = None
        corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        keep_raw_vocab = False
        kwargs     = {}
        progress_per = 10000
        self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
        total_words = 29
        trim_rule  = None
        update     = False
gensim/models/word2vec.py:772: in prepare_vocab
    self.make_cum_table()
        downsample_total = 3.5001157321504532
        downsample_unique = 12
        drop_total = 0
        drop_unique = 0
        dry_run    = False
        keep_raw_vocab = False
        min_count  = 1
        original_total = 29
        original_unique_total = 12
        report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
        retain_pct = 100.0
        retain_total = 29
        retain_unique_pct = 100.0
        retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
        sample     = 0.001
        self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
        threshold_count = 0.029
        trim_rule  = None
        update     = False
        v          = 2
        w          = 'minors'
        word       = 'minors'
        word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>, domain = 2147483647

    def make_cum_table(self, domain=2**31 - 1):
        """Create a cumulative-distribution table using stored vocabulary word counts for
        drawing random words in the negative-sampling training routines.
    
        To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
        then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
        That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
    
        """
        vocab_size = len(self.wv.index_to_key)
        self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
        # compute sum of all power (Z in paper)
        train_words_pow = 0.0
        for word_index in range(vocab_size):
            count = self.wv.get_vecattr(word_index, 'count')
>           train_words_pow += count**self.ns_exponent
E           ValueError: Integers to negative integer powers are not allowed.

count      = 4
domain     = 2147483647
self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
train_words_pow = 0.0
vocab_size = 12
word_index = 0

gensim/models/word2vec.py:836: ValueError
_____________________________________________________________________________________ TestFastTextModel.test_negative_ns_exp ______________________________________________________________________________________

self = <gensim.test.test_fasttext.TestFastTextModel testMethod=test_negative_ns_exp>

    def test_negative_ns_exp(self):
        # We expect that model should train, save, load and continue training without any exceptions
>       model = FT_gensim(sentences, ns_exponent=-1, min_count=1, workers=1)

self       = <gensim.test.test_fasttext.TestFastTextModel testMethod=test_negative_ns_exp>

gensim/test/test_fasttext.py:767: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/fasttext.py:435: in __init__
    super(FastText, self).__init__(
        __class__  = <class 'gensim.models.fasttext.FastText'>
        alpha      = 0.025
        batch_words = 10000
        bucket     = 2000000
        callbacks  = ()
        cbow_mean  = 1
        corpus_file = None
        epochs     = 5
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_n      = 6
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        min_n      = 3
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
        sentences  = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        word_ngrams = 1
        workers    = 1
gensim/models/word2vec.py:425: in __init__
    self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
        alpha      = 0.025
        batch_words = 10000
        callbacks  = ()
        cbow_mean  = 1
        comment    = None
        compute_loss = False
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        epochs     = 5
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
        sentences  = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        workers    = 1
gensim/models/word2vec.py:491: in build_vocab
    report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
        corpus_count = 9
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        keep_raw_vocab = False
        kwargs     = {}
        progress_per = 10000
        self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
        total_words = 29
        trim_rule  = None
        update     = False
gensim/models/word2vec.py:772: in prepare_vocab
    self.make_cum_table()
        downsample_total = 3.5001157321504532
        downsample_unique = 12
        drop_total = 0
        drop_unique = 0
        dry_run    = False
        keep_raw_vocab = False
        min_count  = 1
        original_total = 29
        original_unique_total = 12
        report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
        retain_pct = 100.0
        retain_total = 29
        retain_unique_pct = 100.0
        retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
        sample     = 0.001
        self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
        threshold_count = 0.029
        trim_rule  = None
        update     = False
        v          = 2
        w          = 'minors'
        word       = 'minors'
        word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>, domain = 2147483647

    def make_cum_table(self, domain=2**31 - 1):
        """Create a cumulative-distribution table using stored vocabulary word counts for
        drawing random words in the negative-sampling training routines.
    
        To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
        then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
        That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
    
        """
        vocab_size = len(self.wv.index_to_key)
        self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
        # compute sum of all power (Z in paper)
        train_words_pow = 0.0
        for word_index in range(vocab_size):
            count = self.wv.get_vecattr(word_index, 'count')
>           train_words_pow += count**self.ns_exponent
E           ValueError: Integers to negative integer powers are not allowed.

count      = 4
domain     = 2147483647
self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
train_words_pow = 0.0
vocab_size = 12
word_index = 0

gensim/models/word2vec.py:836: ValueError
_____________________________________________________________________________________ TestWord2VecModel.test_negative_ns_exp ______________________________________________________________________________________

self = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>

    def test_negative_ns_exp(self):
        # We expect that model should train, save, load and continue training without any exceptions
>       model = word2vec.Word2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)

self       = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>

gensim/test/test_word2vec.py:1059: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/word2vec.py:425: in __init__
    self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
        alpha      = 0.025
        batch_words = 10000
        callbacks  = ()
        cbow_mean  = 1
        comment    = None
        compute_loss = False
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        epochs     = 5
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
        sentences  = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        workers    = 1
gensim/models/word2vec.py:491: in build_vocab
    report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
        corpus_count = 9
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        keep_raw_vocab = False
        kwargs     = {}
        progress_per = 10000
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
        total_words = 29
        trim_rule  = None
        update     = False
gensim/models/word2vec.py:772: in prepare_vocab
    self.make_cum_table()
        downsample_total = 3.5001157321504532
        downsample_unique = 12
        drop_total = 0
        drop_unique = 0
        dry_run    = False
        keep_raw_vocab = False
        min_count  = 1
        original_total = 29
        original_unique_total = 12
        report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
        retain_pct = 100.0
        retain_total = 29
        retain_unique_pct = 100.0
        retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
        sample     = 0.001
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
        threshold_count = 0.029
        trim_rule  = None
        update     = False
        v          = 2
        w          = 'minors'
        word       = 'minors'
        word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>, domain = 2147483647

    def make_cum_table(self, domain=2**31 - 1):
        """Create a cumulative-distribution table using stored vocabulary word counts for
        drawing random words in the negative-sampling training routines.
    
        To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
        then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
        That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
    
        """
        vocab_size = len(self.wv.index_to_key)
        self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
        # compute sum of all power (Z in paper)
        train_words_pow = 0.0
        for word_index in range(vocab_size):
            count = self.wv.get_vecattr(word_index, 'count')
>           train_words_pow += count**self.ns_exponent
E           ValueError: Integers to negative integer powers are not allowed.

count      = 4
domain     = 2147483647
self       = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
train_words_pow = 0.0
vocab_size = 12
word_index = 0

gensim/models/word2vec.py:836: ValueError
================================================================================================ warnings summary =================================================================================================
.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21
  /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
  scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
    _deprecated()

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================== slowest 20 durations ===============================================================================================
16.95s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_parallel
11.53s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs
8.29s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
7.08s call     gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_skipgram
6.82s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs
6.74s call     gensim/test/test_fasttext.py::test_sg_hs_training[False]
6.65s call     gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[False]
6.36s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training
5.61s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_fixedwindowsize
5.47s call     gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_cbow
5.29s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_sg_fixedwindowsize
5.14s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
5.13s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_load_old_models_3_x
4.82s call     gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training_fromfile
4.69s call     gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training
4.62s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_neg_fromfile
4.49s call     gensim/test/test_fasttext.py::test_sg_hs_training[True]
4.47s call     gensim/test/test_fasttext.py::SaveGensimByteIdentityTest::test_skipgram
4.35s call     gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[True]
4.33s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_deterministic_dmc
============================================================================================= short test summary info =============================================================================================
FAILED gensim/test/test_word2vec.py::TestWord2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_doc2vec.py::TestDoc2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_fasttext.py::TestFastTextModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_fasttext.py::TestWord2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
SKIPPED [2] gensim/test/test_word2vec.py:637: bulk test only occasionally run locally
SKIPPED [1] gensim/test/test_doc2vec.py:249: See another test for posix above
SKIPPED [1] gensim/test/test_fasttext.py:1694: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1691: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1745: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1742: FT_HOME env variable not set, skipping test
========================================================================= 4 failed, 270 passed, 7 skipped, 1 warning in 305.29s (0:05:05) =========================================================================
ERROR: InvocationError for command /home/ivan/release/gensim/.tox/py38-linux/bin/pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py (exited with code 1)
_____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________
ERROR:   py38-linux: commands failed
(gensim) ivan@T490:~/release/gensim$ 

@menshikh-iv
Copy link
Contributor Author

menshikh-iv commented Oct 23, 2021

And after the fix (will be in next commit) tests succesfully passed, see f2b5db3

(gensim) ivan@T490:~/release/gensim$ tox -e py38-linux gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
py38-linux recreate: /home/ivan/release/gensim/.tox/py38-linux
py38-linux installdeps: pip>=19.1.1, .[test]
py38-linux installed: attrs==21.2.0,certifi==2021.10.8,charset-normalizer==2.0.7,Cython==0.29.24,gensim @ file:///home/ivan/release/gensim,idna==3.3,iniconfig==1.1.1,jsonpatch==1.32,jsonpointer==2.1,mock==4.0.3,Morfessor==2.0.6,nmslib==2.1.1,numpy==1.21.3,packaging==21.0,Pillow==8.4.0,pluggy==1.0.0,psutil==5.8.0,py==1.10.0,pybind11==2.6.1,pyemd==0.5.1,pyparsing==2.4.7,pytest==6.2.5,pyzmq==22.3.0,requests==2.26.0,scipy==1.7.1,six==1.16.0,smart-open==5.2.1,testfixtures==6.18.3,toml==0.10.2,torchfile==0.1.0,tornado==6.1,urllib3==1.26.7,visdom==0.1.8.9,websocket-client==1.2.1
py38-linux run-test-pre: PYTHONHASHSEED='1'
py38-linux run-test: commands[0] | python --version
Python 3.8.10
py38-linux run-test: commands[1] | pip --version
pip 21.2.4 from /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/pip (python 3.8)
py38-linux run-test: commands[2] | python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/corpora/_mmreader.cpython-38-x86_64-linux-gnu.so -> gensim/corpora
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/_matutils.cpython-38-x86_64-linux-gnu.so -> gensim
copying build/lib.linux-x86_64-3.8/gensim/models/nmf_pgd.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/similarities/fastss.cpython-38-x86_64-linux-gnu.so -> gensim/similarities
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
py38-linux run-test: commands[3] | pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
cachedir: .tox/py38-linux/.pytest_cache
rootdir: /home/ivan/release/gensim, configfile: tox.ini
collected 281 items                                                                                                                                                                                               

gensim/test/test_word2vec.py ..........................s.....................................................                                                                                               [ 28%]
gensim/test/test_doc2vec.py ............................s.................                                                                                                                                  [ 44%]
gensim/test/test_fasttext.py .............................................................................ssss.............................s............................................                    [100%]

================================================================================================ warnings summary =================================================================================================
.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21
  /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
  scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
    _deprecated()

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================== slowest 20 durations ===============================================================================================
12.21s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_parallel
7.33s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs
7.19s call     gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_skipgram
6.87s call     gensim/test/test_fasttext.py::test_sg_hs_training[False]
6.77s call     gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[False]
6.74s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs
5.76s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training
5.47s call     gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_cbow
5.39s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_sg_fixedwindowsize
5.35s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_fixedwindowsize
5.24s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
5.17s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
4.79s call     gensim/test/test_fasttext.py::SaveGensimByteIdentityTest::test_skipgram
4.33s call     gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training_fromfile
4.28s call     gensim/test/test_fasttext.py::test_sg_hs_training[True]
4.26s call     gensim/test/test_fasttext.py::test_cbow_hs_training_fromfile[False]
4.25s call     gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training
4.21s call     gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[True]
3.95s call     gensim/test/test_fasttext.py::test_cbow_hs_training[False]
3.88s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training_fromfile
============================================================================================= short test summary info =============================================================================================
SKIPPED [2] gensim/test/test_word2vec.py:637: bulk test only occasionally run locally
SKIPPED [1] gensim/test/test_doc2vec.py:249: See another test for posix above
SKIPPED [1] gensim/test/test_fasttext.py:1694: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1691: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1745: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1742: FT_HOME env variable not set, skipping test
============================================================================== 274 passed, 7 skipped, 1 warning in 268.54s (0:04:28) ==============================================================================
_____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________
  py38-linux: commands succeeded
  congratulations :)

@menshikh-iv menshikh-iv changed the title WIP: Make negative ns_exponent works correctly Make negative ns_exponent works correctly Oct 23, 2021
@piskvorky piskvorky requested a review from gojomo October 23, 2021 14:23
Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

gensim/models/word2vec.py Outdated Show resolved Hide resolved
gensim/test/test_doc2vec.py Outdated Show resolved Hide resolved
gensim/test/test_fasttext.py Outdated Show resolved Hide resolved
gensim/test/test_word2vec.py Outdated Show resolved Hide resolved
@gojomo
Copy link
Collaborator

gojomo commented Oct 26, 2021

Tests are great, but I think an even better fix would be to just convert whatever is in ns_exponent to a float at the single place it needs to be interpreted as a negative floating-point number. That is, changing the line train_words_pow += count**self.ns_exponent to train_words_pow += count**float(self.ns_exponent). With that in place, the init & load fixups are superfluous.

It's fewer changes, solves the same issue (including in older loaded models) – and would also armor the code against a user modifying the ns_exponent value via direct assignment later, as seems possible if using a single model as a repeated template for a series of parameter tests.

@piskvorky piskvorky changed the title Make negative ns_exponent works correctly Make negative ns_exponent work correctly Oct 27, 2021
@menshikh-iv
Copy link
Contributor Author

@gojomo applied your suggestion 521f849

@codecov
Copy link

codecov bot commented Oct 27, 2021

Codecov Report

Merging #3250 (521f849) into develop (e51288c) will increase coverage by 0.11%.
The diff coverage is 100.00%.

❗ Current head 521f849 differs from pull request most recent head 5e8ec5c. Consider uploading reports for the commit 5e8ec5c to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #3250      +/-   ##
===========================================
+ Coverage    78.88%   79.00%   +0.11%     
===========================================
  Files           68       68              
  Lines        11772    11772              
===========================================
+ Hits          9286     9300      +14     
+ Misses        2486     2472      -14     
Impacted Files Coverage Δ
gensim/models/word2vec.py 89.77% <100.00%> (ø)
gensim/utils.py 71.54% <0.00%> (+0.48%) ⬆️
gensim/corpora/wikicorpus.py 87.50% <0.00%> (+5.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e51288c...5e8ec5c. Read the comment docs.

@gojomo
Copy link
Collaborator

gojomo commented Oct 27, 2021

LGTM!

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 28, 2021

Merged. Thank you @menshikh-iv !!

@mpenkov mpenkov merged commit 6e36266 into piskvorky:develop Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Negative exponent with value -1 (minus one) raises error when loading Doc2Vec model
3 participants