Make negative ns_exponent work correctly #3250

menshikh-iv · 2021-10-23T13:29:09Z

menshikh-iv · 2021-10-23T13:32:54Z

Now we're waiting until current tests fail on 3 new cases (w2v, d2v, ft), see 32da953 commit

My local test exec report

(gensim) ivan@T490:~/release/gensim$ tox -e py38-linux gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
py38-linux create: /home/ivan/release/gensim/.tox/py38-linux
py38-linux installdeps: pip>=19.1.1, .[test]
py38-linux installed: attrs==21.2.0,certifi==2021.10.8,charset-normalizer==2.0.7,Cython==0.29.24,gensim @ file:///home/ivan/release/gensim,idna==3.3,iniconfig==1.1.1,jsonpatch==1.32,jsonpointer==2.1,mock==4.0.3,Morfessor==2.0.6,nmslib==2.1.1,numpy==1.21.3,packaging==21.0,Pillow==8.4.0,pluggy==1.0.0,psutil==5.8.0,py==1.10.0,pybind11==2.6.1,pyemd==0.5.1,pyparsing==2.4.7,pytest==6.2.5,pyzmq==22.3.0,requests==2.26.0,scipy==1.7.1,six==1.16.0,smart-open==5.2.1,testfixtures==6.18.3,toml==0.10.2,torchfile==0.1.0,tornado==6.1,urllib3==1.26.7,visdom==0.1.8.9,websocket-client==1.2.1
py38-linux run-test-pre: PYTHONHASHSEED='1'
py38-linux run-test: commands[0] | python --version
Python 3.8.10
py38-linux run-test: commands[1] | pip --version
pip 21.2.4 from /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/pip (python 3.8)
py38-linux run-test: commands[2] | python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/corpora/_mmreader.cpython-38-x86_64-linux-gnu.so -> gensim/corpora
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/_matutils.cpython-38-x86_64-linux-gnu.so -> gensim
copying build/lib.linux-x86_64-3.8/gensim/models/nmf_pgd.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/similarities/fastss.cpython-38-x86_64-linux-gnu.so -> gensim/similarities
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
py38-linux run-test: commands[3] | pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
================================================================= test session starts =================================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
cachedir: .tox/py38-linux/.pytest_cache
rootdir: /home/ivan/release/gensim, configfile: tox.ini
collected 281 items                                                                                                                                   

gensim/test/test_word2vec.py ..........................sF....................................................                                   [ 28%]
gensim/test/test_doc2vec.py ............................s.....F...........                                                                      [ 44%]
gensim/test/test_fasttext.py ..................F..........................................................ssss.............................sF.. [ 85%]
.........................................                                                                                                                                                                   [100%]

==================================================================================================== FAILURES =====================================================================================================
_____________________________________________________________________________________ TestWord2VecModel.test_negative_ns_exp ______________________________________________________________________________________

self = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>

    def test_negative_ns_exp(self):
        # We expect that model should train, save, load and continue training without any exceptions
>       model = word2vec.Word2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)

self       = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>

gensim/test/test_word2vec.py:1059: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/word2vec.py:425: in __init__
    self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
        alpha      = 0.025
        batch_words = 10000
        callbacks  = ()
        cbow_mean  = 1
        comment    = None
        compute_loss = False
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        epochs     = 5
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
        sentences  = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        workers    = 1
gensim/models/word2vec.py:491: in build_vocab
    report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
        corpus_count = 9
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        keep_raw_vocab = False
        kwargs     = {}
        progress_per = 10000
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
        total_words = 29
        trim_rule  = None
        update     = False
gensim/models/word2vec.py:772: in prepare_vocab
    self.make_cum_table()
        downsample_total = 3.5001157321504532
        downsample_unique = 12
        drop_total = 0
        drop_unique = 0
        dry_run    = False
        keep_raw_vocab = False
        min_count  = 1
        original_total = 29
        original_unique_total = 12
        report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
        retain_pct = 100.0
        retain_total = 29
        retain_unique_pct = 100.0
        retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
        sample     = 0.001
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
        threshold_count = 0.029
        trim_rule  = None
        update     = False
        v          = 2
        w          = 'minors'
        word       = 'minors'
        word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>, domain = 2147483647

    def make_cum_table(self, domain=2**31 - 1):
        """Create a cumulative-distribution table using stored vocabulary word counts for
        drawing random words in the negative-sampling training routines.
    
        To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
        then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
        That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
    
        """
        vocab_size = len(self.wv.index_to_key)
        self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
        # compute sum of all power (Z in paper)
        train_words_pow = 0.0
        for word_index in range(vocab_size):
            count = self.wv.get_vecattr(word_index, 'count')
>           train_words_pow += count**self.ns_exponent
E           ValueError: Integers to negative integer powers are not allowed.

count      = 4
domain     = 2147483647
self       = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
train_words_pow = 0.0
vocab_size = 12
word_index = 0

gensim/models/word2vec.py:836: ValueError
______________________________________________________________________________________ TestDoc2VecModel.test_negative_ns_exp ______________________________________________________________________________________

self = <gensim.test.test_doc2vec.TestDoc2VecModel testMethod=test_negative_ns_exp>

    def test_negative_ns_exp(self):
        # We expect that model should train, save, load and continue training without any exceptions
>       model = doc2vec.Doc2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)

self       = <gensim.test.test_doc2vec.TestDoc2VecModel testMethod=test_negative_ns_exp>

gensim/test/test_doc2vec.py:726: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/doc2vec.py:294: in __init__
    super(Doc2Vec, self).__init__(
        __class__  = <class 'gensim.models.doc2vec.Doc2Vec'>
        callbacks  = ()
        comment    = None
        corpus_file = None
        corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        dbow_words = 0
        dm         = 1
        dm_concat  = 0
        dm_mean    = None
        dm_tag_count = 1
        documents  = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        dv         = None
        dv_mapfile = None
        epochs     = 10
        kwargs     = {'min_count': 1, 'ns_exponent': -1, 'workers': 1}
        self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
        shrink_windows = True
        trim_rule  = None
        vector_size = 100
        window     = 5
gensim/models/word2vec.py:425: in __init__
    self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
        alpha      = 0.025
        batch_words = 10000
        callbacks  = ()
        cbow_mean  = 1
        comment    = None
        compute_loss = False
        corpus_file = None
        corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        epochs     = 10
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
        sentences  = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        workers    = 1
gensim/models/doc2vec.py:884: in build_vocab
    report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
        corpus_count = 9
        corpus_file = None
        corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
        keep_raw_vocab = False
        kwargs     = {}
        progress_per = 10000
        self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
        total_words = 29
        trim_rule  = None
        update     = False
gensim/models/word2vec.py:772: in prepare_vocab
    self.make_cum_table()
        downsample_total = 3.5001157321504532
        downsample_unique = 12
        drop_total = 0
        drop_unique = 0
        dry_run    = False
        keep_raw_vocab = False
        min_count  = 1
        original_total = 29
        original_unique_total = 12
        report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
        retain_pct = 100.0
        retain_total = 29
        retain_unique_pct = 100.0
        retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
        sample     = 0.001
        self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
        threshold_count = 0.029
        trim_rule  = None
        update     = False
        v          = 2
        w          = 'minors'
        word       = 'minors'
        word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>, domain = 2147483647

    def make_cum_table(self, domain=2**31 - 1):
        """Create a cumulative-distribution table using stored vocabulary word counts for
        drawing random words in the negative-sampling training routines.
    
        To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
        then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
        That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
    
        """
        vocab_size = len(self.wv.index_to_key)
        self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
        # compute sum of all power (Z in paper)
        train_words_pow = 0.0
        for word_index in range(vocab_size):
            count = self.wv.get_vecattr(word_index, 'count')
>           train_words_pow += count**self.ns_exponent
E           ValueError: Integers to negative integer powers are not allowed.

count      = 4
domain     = 2147483647
self       = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
train_words_pow = 0.0
vocab_size = 12
word_index = 0

gensim/models/word2vec.py:836: ValueError
_____________________________________________________________________________________ TestFastTextModel.test_negative_ns_exp ______________________________________________________________________________________

self = <gensim.test.test_fasttext.TestFastTextModel testMethod=test_negative_ns_exp>

    def test_negative_ns_exp(self):
        # We expect that model should train, save, load and continue training without any exceptions
>       model = FT_gensim(sentences, ns_exponent=-1, min_count=1, workers=1)

self       = <gensim.test.test_fasttext.TestFastTextModel testMethod=test_negative_ns_exp>

gensim/test/test_fasttext.py:767: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/fasttext.py:435: in __init__
    super(FastText, self).__init__(
        __class__  = <class 'gensim.models.fasttext.FastText'>
        alpha      = 0.025
        batch_words = 10000
        bucket     = 2000000
        callbacks  = ()
        cbow_mean  = 1
        corpus_file = None
        epochs     = 5
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_n      = 6
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        min_n      = 3
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
        sentences  = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        word_ngrams = 1
        workers    = 1
gensim/models/word2vec.py:425: in __init__
    self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
        alpha      = 0.025
        batch_words = 10000
        callbacks  = ()
        cbow_mean  = 1
        comment    = None
        compute_loss = False
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        epochs     = 5
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
        sentences  = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        workers    = 1
gensim/models/word2vec.py:491: in build_vocab
    report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
        corpus_count = 9
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        keep_raw_vocab = False
        kwargs     = {}
        progress_per = 10000
        self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
        total_words = 29
        trim_rule  = None
        update     = False
gensim/models/word2vec.py:772: in prepare_vocab
    self.make_cum_table()
        downsample_total = 3.5001157321504532
        downsample_unique = 12
        drop_total = 0
        drop_unique = 0
        dry_run    = False
        keep_raw_vocab = False
        min_count  = 1
        original_total = 29
        original_unique_total = 12
        report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
        retain_pct = 100.0
        retain_total = 29
        retain_unique_pct = 100.0
        retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
        sample     = 0.001
        self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
        threshold_count = 0.029
        trim_rule  = None
        update     = False
        v          = 2
        w          = 'minors'
        word       = 'minors'
        word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>, domain = 2147483647

    def make_cum_table(self, domain=2**31 - 1):
        """Create a cumulative-distribution table using stored vocabulary word counts for
        drawing random words in the negative-sampling training routines.
    
        To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
        then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
        That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
    
        """
        vocab_size = len(self.wv.index_to_key)
        self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
        # compute sum of all power (Z in paper)
        train_words_pow = 0.0
        for word_index in range(vocab_size):
            count = self.wv.get_vecattr(word_index, 'count')
>           train_words_pow += count**self.ns_exponent
E           ValueError: Integers to negative integer powers are not allowed.

count      = 4
domain     = 2147483647
self       = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
train_words_pow = 0.0
vocab_size = 12
word_index = 0

gensim/models/word2vec.py:836: ValueError
_____________________________________________________________________________________ TestWord2VecModel.test_negative_ns_exp ______________________________________________________________________________________

self = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>

    def test_negative_ns_exp(self):
        # We expect that model should train, save, load and continue training without any exceptions
>       model = word2vec.Word2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)

self       = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>

gensim/test/test_word2vec.py:1059: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/word2vec.py:425: in __init__
    self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
        alpha      = 0.025
        batch_words = 10000
        callbacks  = ()
        cbow_mean  = 1
        comment    = None
        compute_loss = False
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        epochs     = 5
        hashfxn    = <built-in function hash>
        hs         = 0
        max_final_vocab = None
        max_vocab_size = None
        min_alpha  = 0.0001
        min_count  = 1
        negative   = 5
        ns_exponent = -1
        null_word  = 0
        sample     = 0.001
        seed       = 1
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
        sentences  = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        sg         = 0
        shrink_windows = True
        sorted_vocab = 1
        trim_rule  = None
        vector_size = 100
        window     = 5
        workers    = 1
gensim/models/word2vec.py:491: in build_vocab
    report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
        corpus_count = 9
        corpus_file = None
        corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
        keep_raw_vocab = False
        kwargs     = {}
        progress_per = 10000
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
        total_words = 29
        trim_rule  = None
        update     = False
gensim/models/word2vec.py:772: in prepare_vocab
    self.make_cum_table()
        downsample_total = 3.5001157321504532
        downsample_unique = 12
        drop_total = 0
        drop_unique = 0
        dry_run    = False
        keep_raw_vocab = False
        min_count  = 1
        original_total = 29
        original_unique_total = 12
        report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
        retain_pct = 100.0
        retain_total = 29
        retain_unique_pct = 100.0
        retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
        sample     = 0.001
        self       = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
        threshold_count = 0.029
        trim_rule  = None
        update     = False
        v          = 2
        w          = 'minors'
        word       = 'minors'
        word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>, domain = 2147483647

    def make_cum_table(self, domain=2**31 - 1):
        """Create a cumulative-distribution table using stored vocabulary word counts for
        drawing random words in the negative-sampling training routines.
    
        To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
        then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
        That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
    
        """
        vocab_size = len(self.wv.index_to_key)
        self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
        # compute sum of all power (Z in paper)
        train_words_pow = 0.0
        for word_index in range(vocab_size):
            count = self.wv.get_vecattr(word_index, 'count')
>           train_words_pow += count**self.ns_exponent
E           ValueError: Integers to negative integer powers are not allowed.

count      = 4
domain     = 2147483647
self       = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
train_words_pow = 0.0
vocab_size = 12
word_index = 0

gensim/models/word2vec.py:836: ValueError
================================================================================================ warnings summary =================================================================================================
.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21
  /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
  scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
    _deprecated()

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================== slowest 20 durations ===============================================================================================
16.95s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_parallel
11.53s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs
8.29s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
7.08s call     gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_skipgram
6.82s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs
6.74s call     gensim/test/test_fasttext.py::test_sg_hs_training[False]
6.65s call     gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[False]
6.36s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training
5.61s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_fixedwindowsize
5.47s call     gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_cbow
5.29s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_sg_fixedwindowsize
5.14s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
5.13s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_load_old_models_3_x
4.82s call     gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training_fromfile
4.69s call     gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training
4.62s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_neg_fromfile
4.49s call     gensim/test/test_fasttext.py::test_sg_hs_training[True]
4.47s call     gensim/test/test_fasttext.py::SaveGensimByteIdentityTest::test_skipgram
4.35s call     gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[True]
4.33s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_deterministic_dmc
============================================================================================= short test summary info =============================================================================================
FAILED gensim/test/test_word2vec.py::TestWord2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_doc2vec.py::TestDoc2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_fasttext.py::TestFastTextModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_fasttext.py::TestWord2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
SKIPPED [2] gensim/test/test_word2vec.py:637: bulk test only occasionally run locally
SKIPPED [1] gensim/test/test_doc2vec.py:249: See another test for posix above
SKIPPED [1] gensim/test/test_fasttext.py:1694: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1691: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1745: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1742: FT_HOME env variable not set, skipping test
========================================================================= 4 failed, 270 passed, 7 skipped, 1 warning in 305.29s (0:05:05) =========================================================================
ERROR: InvocationError for command /home/ivan/release/gensim/.tox/py38-linux/bin/pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py (exited with code 1)
_____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________
ERROR:   py38-linux: commands failed
(gensim) ivan@T490:~/release/gensim$

menshikh-iv · 2021-10-23T13:52:14Z

And after the fix (will be in next commit) tests succesfully passed, see f2b5db3

(gensim) ivan@T490:~/release/gensim$ tox -e py38-linux gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
py38-linux recreate: /home/ivan/release/gensim/.tox/py38-linux
py38-linux installdeps: pip>=19.1.1, .[test]
py38-linux installed: attrs==21.2.0,certifi==2021.10.8,charset-normalizer==2.0.7,Cython==0.29.24,gensim @ file:///home/ivan/release/gensim,idna==3.3,iniconfig==1.1.1,jsonpatch==1.32,jsonpointer==2.1,mock==4.0.3,Morfessor==2.0.6,nmslib==2.1.1,numpy==1.21.3,packaging==21.0,Pillow==8.4.0,pluggy==1.0.0,psutil==5.8.0,py==1.10.0,pybind11==2.6.1,pyemd==0.5.1,pyparsing==2.4.7,pytest==6.2.5,pyzmq==22.3.0,requests==2.26.0,scipy==1.7.1,six==1.16.0,smart-open==5.2.1,testfixtures==6.18.3,toml==0.10.2,torchfile==0.1.0,tornado==6.1,urllib3==1.26.7,visdom==0.1.8.9,websocket-client==1.2.1
py38-linux run-test-pre: PYTHONHASHSEED='1'
py38-linux run-test: commands[0] | python --version
Python 3.8.10
py38-linux run-test: commands[1] | pip --version
pip 21.2.4 from /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/pip (python 3.8)
py38-linux run-test: commands[2] | python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/corpora/_mmreader.cpython-38-x86_64-linux-gnu.so -> gensim/corpora
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/_matutils.cpython-38-x86_64-linux-gnu.so -> gensim
copying build/lib.linux-x86_64-3.8/gensim/models/nmf_pgd.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/similarities/fastss.cpython-38-x86_64-linux-gnu.so -> gensim/similarities
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
py38-linux run-test: commands[3] | pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
cachedir: .tox/py38-linux/.pytest_cache
rootdir: /home/ivan/release/gensim, configfile: tox.ini
collected 281 items                                                                                                                                                                                               

gensim/test/test_word2vec.py ..........................s.....................................................                                                                                               [ 28%]
gensim/test/test_doc2vec.py ............................s.................                                                                                                                                  [ 44%]
gensim/test/test_fasttext.py .............................................................................ssss.............................s............................................                    [100%]

================================================================================================ warnings summary =================================================================================================
.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21
  /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
  scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
    _deprecated()

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================== slowest 20 durations ===============================================================================================
12.21s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_parallel
7.33s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs
7.19s call     gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_skipgram
6.87s call     gensim/test/test_fasttext.py::test_sg_hs_training[False]
6.77s call     gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[False]
6.74s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs
5.76s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training
5.47s call     gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_cbow
5.39s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_sg_fixedwindowsize
5.35s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_fixedwindowsize
5.24s call     gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
5.17s call     gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
4.79s call     gensim/test/test_fasttext.py::SaveGensimByteIdentityTest::test_skipgram
4.33s call     gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training_fromfile
4.28s call     gensim/test/test_fasttext.py::test_sg_hs_training[True]
4.26s call     gensim/test/test_fasttext.py::test_cbow_hs_training_fromfile[False]
4.25s call     gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training
4.21s call     gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[True]
3.95s call     gensim/test/test_fasttext.py::test_cbow_hs_training[False]
3.88s call     gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training_fromfile
============================================================================================= short test summary info =============================================================================================
SKIPPED [2] gensim/test/test_word2vec.py:637: bulk test only occasionally run locally
SKIPPED [1] gensim/test/test_doc2vec.py:249: See another test for posix above
SKIPPED [1] gensim/test/test_fasttext.py:1694: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1691: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1745: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1742: FT_HOME env variable not set, skipping test
============================================================================== 274 passed, 7 skipped, 1 warning in 268.54s (0:04:28) ==============================================================================
_____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________
  py38-linux: commands succeeded
  congratulations :)

mpenkov

LGTM

gensim/models/word2vec.py

gensim/test/test_doc2vec.py

gensim/test/test_fasttext.py

gensim/test/test_word2vec.py

gojomo · 2021-10-26T23:07:20Z

Tests are great, but I think an even better fix would be to just convert whatever is in ns_exponent to a float at the single place it needs to be interpreted as a negative floating-point number. That is, changing the line train_words_pow += count**self.ns_exponent to train_words_pow += count**float(self.ns_exponent). With that in place, the init & load fixups are superfluous.

It's fewer changes, solves the same issue (including in older loaded models) – and would also armor the code against a user modifying the ns_exponent value via direct assignment later, as seems possible if using a single model as a repeated template for a series of parameter tests.

menshikh-iv · 2021-10-27T07:42:24Z

@gojomo applied your suggestion 521f849

codecov · 2021-10-27T07:59:53Z

Codecov Report

Merging #3250 (521f849) into develop (e51288c) will increase coverage by 0.11%.
The diff coverage is 100.00%.

❗ Current head 521f849 differs from pull request most recent head 5e8ec5c. Consider uploading reports for the commit 5e8ec5c to get more accurate results

@@             Coverage Diff             @@
##           develop    #3250      +/-   ##
===========================================
+ Coverage    78.88%   79.00%   +0.11%     
===========================================
  Files           68       68              
  Lines        11772    11772              
===========================================
+ Hits          9286     9300      +14     
+ Misses        2486     2472      -14

Impacted Files	Coverage Δ
gensim/models/word2vec.py	`89.77% <100.00%> (ø)`
gensim/utils.py	`71.54% <0.00%> (+0.48%)`	⬆️
gensim/corpora/wikicorpus.py	`87.50% <0.00%> (+5.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e51288c...5e8ec5c. Read the comment docs.

gojomo · 2021-10-27T16:57:43Z

LGTM!

mpenkov · 2021-10-28T01:18:47Z

Merged. Thank you @menshikh-iv !!

menshikh-iv added 2 commits October 23, 2021 18:28

add tests with negative ns_exponent

0bb9cf5

fix flake8

32da953

explicitly cast ns_exponent to FLOAT

f2b5db3

menshikh-iv changed the title ~~WIP: Make negative ns_exponent works correctly~~ Make negative ns_exponent works correctly Oct 23, 2021

piskvorky requested a review from gojomo October 23, 2021 14:23

mpenkov added the hacktoberfest-accepted label Oct 24, 2021

mpenkov approved these changes Oct 24, 2021

View reviewed changes

gensim/models/word2vec.py Outdated Show resolved Hide resolved

gensim/test/test_doc2vec.py Outdated Show resolved Hide resolved

gensim/test/test_fasttext.py Outdated Show resolved Hide resolved

gensim/test/test_word2vec.py Outdated Show resolved Hide resolved

Apply suggestions from code review

d2de56a

piskvorky mentioned this pull request Oct 27, 2021

Fixed error message when raising to negative exponent #3232 #3259

Closed

piskvorky changed the title ~~Make negative ns_exponent works correctly~~ Make negative ns_exponent work correctly Oct 27, 2021

dynamic cast

521f849

mpenkov added 3 commits October 28, 2021 10:17

Update CHANGELOG.md

592bcc7

Merge branch 'develop' into fix-negative-exp

2019d9f

Update CHANGELOG.md

5e8ec5c

mpenkov merged commit 6e36266 into piskvorky:develop Oct 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make negative ns_exponent work correctly #3250

Make negative ns_exponent work correctly #3250

menshikh-iv commented Oct 23, 2021

menshikh-iv commented Oct 23, 2021 •

edited

Loading

menshikh-iv commented Oct 23, 2021 •

edited

Loading

mpenkov left a comment

gojomo commented Oct 26, 2021 •

edited

Loading

menshikh-iv commented Oct 27, 2021

codecov bot commented Oct 27, 2021 •

edited

Loading

gojomo commented Oct 27, 2021

mpenkov commented Oct 28, 2021

Make negative ns_exponent work correctly #3250

Make negative ns_exponent work correctly #3250

Conversation

menshikh-iv commented Oct 23, 2021

menshikh-iv commented Oct 23, 2021 • edited Loading

menshikh-iv commented Oct 23, 2021 • edited Loading

mpenkov left a comment

Choose a reason for hiding this comment

gojomo commented Oct 26, 2021 • edited Loading

menshikh-iv commented Oct 27, 2021

codecov bot commented Oct 27, 2021 • edited Loading

Codecov Report

gojomo commented Oct 27, 2021

mpenkov commented Oct 28, 2021

menshikh-iv commented Oct 23, 2021 •

edited

Loading

menshikh-iv commented Oct 23, 2021 •

edited

Loading

gojomo commented Oct 26, 2021 •

edited

Loading

codecov bot commented Oct 27, 2021 •

edited

Loading