-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make negative ns_exponent work correctly #3250
Conversation
Now we're waiting until current tests fail on 3 new cases (w2v, d2v, ft), see 32da953 commit My local test exec report (gensim) ivan@T490:~/release/gensim$ tox -e py38-linux gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
py38-linux create: /home/ivan/release/gensim/.tox/py38-linux
py38-linux installdeps: pip>=19.1.1, .[test]
py38-linux installed: attrs==21.2.0,certifi==2021.10.8,charset-normalizer==2.0.7,Cython==0.29.24,gensim @ file:///home/ivan/release/gensim,idna==3.3,iniconfig==1.1.1,jsonpatch==1.32,jsonpointer==2.1,mock==4.0.3,Morfessor==2.0.6,nmslib==2.1.1,numpy==1.21.3,packaging==21.0,Pillow==8.4.0,pluggy==1.0.0,psutil==5.8.0,py==1.10.0,pybind11==2.6.1,pyemd==0.5.1,pyparsing==2.4.7,pytest==6.2.5,pyzmq==22.3.0,requests==2.26.0,scipy==1.7.1,six==1.16.0,smart-open==5.2.1,testfixtures==6.18.3,toml==0.10.2,torchfile==0.1.0,tornado==6.1,urllib3==1.26.7,visdom==0.1.8.9,websocket-client==1.2.1
py38-linux run-test-pre: PYTHONHASHSEED='1'
py38-linux run-test: commands[0] | python --version
Python 3.8.10
py38-linux run-test: commands[1] | pip --version
pip 21.2.4 from /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/pip (python 3.8)
py38-linux run-test: commands[2] | python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/corpora/_mmreader.cpython-38-x86_64-linux-gnu.so -> gensim/corpora
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/_matutils.cpython-38-x86_64-linux-gnu.so -> gensim
copying build/lib.linux-x86_64-3.8/gensim/models/nmf_pgd.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/similarities/fastss.cpython-38-x86_64-linux-gnu.so -> gensim/similarities
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
py38-linux run-test: commands[3] | pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
================================================================= test session starts =================================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
cachedir: .tox/py38-linux/.pytest_cache
rootdir: /home/ivan/release/gensim, configfile: tox.ini
collected 281 items
gensim/test/test_word2vec.py ..........................sF.................................................... [ 28%]
gensim/test/test_doc2vec.py ............................s.....F........... [ 44%]
gensim/test/test_fasttext.py ..................F..........................................................ssss.............................sF.. [ 85%]
......................................... [100%]
==================================================================================================== FAILURES =====================================================================================================
_____________________________________________________________________________________ TestWord2VecModel.test_negative_ns_exp ______________________________________________________________________________________
self = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>
def test_negative_ns_exp(self):
# We expect that model should train, save, load and continue training without any exceptions
> model = word2vec.Word2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)
self = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>
gensim/test/test_word2vec.py:1059:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/word2vec.py:425: in __init__
self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
alpha = 0.025
batch_words = 10000
callbacks = ()
cbow_mean = 1
comment = None
compute_loss = False
corpus_file = None
corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
epochs = 5
hashfxn = <built-in function hash>
hs = 0
max_final_vocab = None
max_vocab_size = None
min_alpha = 0.0001
min_count = 1
negative = 5
ns_exponent = -1
null_word = 0
sample = 0.001
seed = 1
self = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
sentences = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
sg = 0
shrink_windows = True
sorted_vocab = 1
trim_rule = None
vector_size = 100
window = 5
workers = 1
gensim/models/word2vec.py:491: in build_vocab
report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
corpus_count = 9
corpus_file = None
corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
keep_raw_vocab = False
kwargs = {}
progress_per = 10000
self = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
total_words = 29
trim_rule = None
update = False
gensim/models/word2vec.py:772: in prepare_vocab
self.make_cum_table()
downsample_total = 3.5001157321504532
downsample_unique = 12
drop_total = 0
drop_unique = 0
dry_run = False
keep_raw_vocab = False
min_count = 1
original_total = 29
original_unique_total = 12
report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
retain_pct = 100.0
retain_total = 29
retain_unique_pct = 100.0
retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
sample = 0.001
self = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
threshold_count = 0.029
trim_rule = None
update = False
v = 2
w = 'minors'
word = 'minors'
word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>, domain = 2147483647
def make_cum_table(self, domain=2**31 - 1):
"""Create a cumulative-distribution table using stored vocabulary word counts for
drawing random words in the negative-sampling training routines.
To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
"""
vocab_size = len(self.wv.index_to_key)
self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
# compute sum of all power (Z in paper)
train_words_pow = 0.0
for word_index in range(vocab_size):
count = self.wv.get_vecattr(word_index, 'count')
> train_words_pow += count**self.ns_exponent
E ValueError: Integers to negative integer powers are not allowed.
count = 4
domain = 2147483647
self = <gensim.models.word2vec.Word2Vec object at 0x7f54865434f0>
train_words_pow = 0.0
vocab_size = 12
word_index = 0
gensim/models/word2vec.py:836: ValueError
______________________________________________________________________________________ TestDoc2VecModel.test_negative_ns_exp ______________________________________________________________________________________
self = <gensim.test.test_doc2vec.TestDoc2VecModel testMethod=test_negative_ns_exp>
def test_negative_ns_exp(self):
# We expect that model should train, save, load and continue training without any exceptions
> model = doc2vec.Doc2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)
self = <gensim.test.test_doc2vec.TestDoc2VecModel testMethod=test_negative_ns_exp>
gensim/test/test_doc2vec.py:726:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/doc2vec.py:294: in __init__
super(Doc2Vec, self).__init__(
__class__ = <class 'gensim.models.doc2vec.Doc2Vec'>
callbacks = ()
comment = None
corpus_file = None
corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
dbow_words = 0
dm = 1
dm_concat = 0
dm_mean = None
dm_tag_count = 1
documents = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
dv = None
dv_mapfile = None
epochs = 10
kwargs = {'min_count': 1, 'ns_exponent': -1, 'workers': 1}
self = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
shrink_windows = True
trim_rule = None
vector_size = 100
window = 5
gensim/models/word2vec.py:425: in __init__
self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
alpha = 0.025
batch_words = 10000
callbacks = ()
cbow_mean = 1
comment = None
compute_loss = False
corpus_file = None
corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
epochs = 10
hashfxn = <built-in function hash>
hs = 0
max_final_vocab = None
max_vocab_size = None
min_alpha = 0.0001
min_count = 1
negative = 5
ns_exponent = -1
null_word = 0
sample = 0.001
seed = 1
self = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
sentences = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
sg = 0
shrink_windows = True
sorted_vocab = 1
trim_rule = None
vector_size = 100
window = 5
workers = 1
gensim/models/doc2vec.py:884: in build_vocab
report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
corpus_count = 9
corpus_file = None
corpus_iterable = [TaggedDocument(words=['human', 'interface', 'computer'], tags=[0]), TaggedDocument(words=['survey', 'user', 'computer...ags=[3]), TaggedDocument(words=['user', 'response', 'time'], tags=[4]), TaggedDocument(words=['trees'], tags=[5]), ...]
keep_raw_vocab = False
kwargs = {}
progress_per = 10000
self = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
total_words = 29
trim_rule = None
update = False
gensim/models/word2vec.py:772: in prepare_vocab
self.make_cum_table()
downsample_total = 3.5001157321504532
downsample_unique = 12
drop_total = 0
drop_unique = 0
dry_run = False
keep_raw_vocab = False
min_count = 1
original_total = 29
original_unique_total = 12
report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
retain_pct = 100.0
retain_total = 29
retain_unique_pct = 100.0
retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
sample = 0.001
self = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
threshold_count = 0.029
trim_rule = None
update = False
v = 2
w = 'minors'
word = 'minors'
word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>, domain = 2147483647
def make_cum_table(self, domain=2**31 - 1):
"""Create a cumulative-distribution table using stored vocabulary word counts for
drawing random words in the negative-sampling training routines.
To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
"""
vocab_size = len(self.wv.index_to_key)
self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
# compute sum of all power (Z in paper)
train_words_pow = 0.0
for word_index in range(vocab_size):
count = self.wv.get_vecattr(word_index, 'count')
> train_words_pow += count**self.ns_exponent
E ValueError: Integers to negative integer powers are not allowed.
count = 4
domain = 2147483647
self = <gensim.models.doc2vec.Doc2Vec object at 0x7f5486d3da60>
train_words_pow = 0.0
vocab_size = 12
word_index = 0
gensim/models/word2vec.py:836: ValueError
_____________________________________________________________________________________ TestFastTextModel.test_negative_ns_exp ______________________________________________________________________________________
self = <gensim.test.test_fasttext.TestFastTextModel testMethod=test_negative_ns_exp>
def test_negative_ns_exp(self):
# We expect that model should train, save, load and continue training without any exceptions
> model = FT_gensim(sentences, ns_exponent=-1, min_count=1, workers=1)
self = <gensim.test.test_fasttext.TestFastTextModel testMethod=test_negative_ns_exp>
gensim/test/test_fasttext.py:767:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/fasttext.py:435: in __init__
super(FastText, self).__init__(
__class__ = <class 'gensim.models.fasttext.FastText'>
alpha = 0.025
batch_words = 10000
bucket = 2000000
callbacks = ()
cbow_mean = 1
corpus_file = None
epochs = 5
hashfxn = <built-in function hash>
hs = 0
max_final_vocab = None
max_n = 6
max_vocab_size = None
min_alpha = 0.0001
min_count = 1
min_n = 3
negative = 5
ns_exponent = -1
null_word = 0
sample = 0.001
seed = 1
self = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
sentences = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
sg = 0
shrink_windows = True
sorted_vocab = 1
trim_rule = None
vector_size = 100
window = 5
word_ngrams = 1
workers = 1
gensim/models/word2vec.py:425: in __init__
self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
alpha = 0.025
batch_words = 10000
callbacks = ()
cbow_mean = 1
comment = None
compute_loss = False
corpus_file = None
corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
epochs = 5
hashfxn = <built-in function hash>
hs = 0
max_final_vocab = None
max_vocab_size = None
min_alpha = 0.0001
min_count = 1
negative = 5
ns_exponent = -1
null_word = 0
sample = 0.001
seed = 1
self = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
sentences = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
sg = 0
shrink_windows = True
sorted_vocab = 1
trim_rule = None
vector_size = 100
window = 5
workers = 1
gensim/models/word2vec.py:491: in build_vocab
report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
corpus_count = 9
corpus_file = None
corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
keep_raw_vocab = False
kwargs = {}
progress_per = 10000
self = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
total_words = 29
trim_rule = None
update = False
gensim/models/word2vec.py:772: in prepare_vocab
self.make_cum_table()
downsample_total = 3.5001157321504532
downsample_unique = 12
drop_total = 0
drop_unique = 0
dry_run = False
keep_raw_vocab = False
min_count = 1
original_total = 29
original_unique_total = 12
report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
retain_pct = 100.0
retain_total = 29
retain_unique_pct = 100.0
retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
sample = 0.001
self = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
threshold_count = 0.029
trim_rule = None
update = False
v = 2
w = 'minors'
word = 'minors'
word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>, domain = 2147483647
def make_cum_table(self, domain=2**31 - 1):
"""Create a cumulative-distribution table using stored vocabulary word counts for
drawing random words in the negative-sampling training routines.
To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
"""
vocab_size = len(self.wv.index_to_key)
self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
# compute sum of all power (Z in paper)
train_words_pow = 0.0
for word_index in range(vocab_size):
count = self.wv.get_vecattr(word_index, 'count')
> train_words_pow += count**self.ns_exponent
E ValueError: Integers to negative integer powers are not allowed.
count = 4
domain = 2147483647
self = <gensim.models.fasttext.FastText object at 0x7f5486875fd0>
train_words_pow = 0.0
vocab_size = 12
word_index = 0
gensim/models/word2vec.py:836: ValueError
_____________________________________________________________________________________ TestWord2VecModel.test_negative_ns_exp ______________________________________________________________________________________
self = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>
def test_negative_ns_exp(self):
# We expect that model should train, save, load and continue training without any exceptions
> model = word2vec.Word2Vec(sentences, ns_exponent=-1, min_count=1, workers=1)
self = <gensim.test.test_word2vec.TestWord2VecModel testMethod=test_negative_ns_exp>
gensim/test/test_word2vec.py:1059:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/models/word2vec.py:425: in __init__
self.build_vocab(corpus_iterable=corpus_iterable, corpus_file=corpus_file, trim_rule=trim_rule)
alpha = 0.025
batch_words = 10000
callbacks = ()
cbow_mean = 1
comment = None
compute_loss = False
corpus_file = None
corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
epochs = 5
hashfxn = <built-in function hash>
hs = 0
max_final_vocab = None
max_vocab_size = None
min_alpha = 0.0001
min_count = 1
negative = 5
ns_exponent = -1
null_word = 0
sample = 0.001
seed = 1
self = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
sentences = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
sg = 0
shrink_windows = True
sorted_vocab = 1
trim_rule = None
vector_size = 100
window = 5
workers = 1
gensim/models/word2vec.py:491: in build_vocab
report_values = self.prepare_vocab(update=update, keep_raw_vocab=keep_raw_vocab, trim_rule=trim_rule, **kwargs)
corpus_count = 9
corpus_file = None
corpus_iterable = [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ...]
keep_raw_vocab = False
kwargs = {}
progress_per = 10000
self = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
total_words = 29
trim_rule = None
update = False
gensim/models/word2vec.py:772: in prepare_vocab
self.make_cum_table()
downsample_total = 3.5001157321504532
downsample_unique = 12
drop_total = 0
drop_unique = 0
dry_run = False
keep_raw_vocab = False
min_count = 1
original_total = 29
original_unique_total = 12
report_values = {'downsample_total': 3, 'downsample_unique': 12, 'drop_unique': 0, 'num_retained_words': 12, ...}
retain_pct = 100.0
retain_total = 29
retain_unique_pct = 100.0
retain_words = ['human', 'interface', 'computer', 'survey', 'user', 'system', ...]
sample = 0.001
self = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
threshold_count = 0.029
trim_rule = None
update = False
v = 2
w = 'minors'
word = 'minors'
word_probability = 0.13491594578792296
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>, domain = 2147483647
def make_cum_table(self, domain=2**31 - 1):
"""Create a cumulative-distribution table using stored vocabulary word counts for
drawing random words in the negative-sampling training routines.
To draw a word index, choose a random integer up to the maximum value in the table (cum_table[-1]),
then finding that integer's sorted insertion point (as if by `bisect_left` or `ndarray.searchsorted()`).
That insertion point is the drawn index, coming up in proportion equal to the increment at that slot.
"""
vocab_size = len(self.wv.index_to_key)
self.cum_table = np.zeros(vocab_size, dtype=np.uint32)
# compute sum of all power (Z in paper)
train_words_pow = 0.0
for word_index in range(vocab_size):
count = self.wv.get_vecattr(word_index, 'count')
> train_words_pow += count**self.ns_exponent
E ValueError: Integers to negative integer powers are not allowed.
count = 4
domain = 2147483647
self = <gensim.models.word2vec.Word2Vec object at 0x7f54844e5580>
train_words_pow = 0.0
vocab_size = 12
word_index = 0
gensim/models/word2vec.py:836: ValueError
================================================================================================ warnings summary =================================================================================================
.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21
/home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
_deprecated()
-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================== slowest 20 durations ===============================================================================================
16.95s call gensim/test/test_doc2vec.py::TestDoc2VecModel::test_parallel
11.53s call gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs
8.29s call gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
7.08s call gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_skipgram
6.82s call gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs
6.74s call gensim/test/test_fasttext.py::test_sg_hs_training[False]
6.65s call gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[False]
6.36s call gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training
5.61s call gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_fixedwindowsize
5.47s call gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_cbow
5.29s call gensim/test/test_fasttext.py::TestWord2VecModel::test_sg_fixedwindowsize
5.14s call gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
5.13s call gensim/test/test_word2vec.py::TestWord2VecModel::test_load_old_models_3_x
4.82s call gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training_fromfile
4.69s call gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training
4.62s call gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_neg_fromfile
4.49s call gensim/test/test_fasttext.py::test_sg_hs_training[True]
4.47s call gensim/test/test_fasttext.py::SaveGensimByteIdentityTest::test_skipgram
4.35s call gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[True]
4.33s call gensim/test/test_doc2vec.py::TestDoc2VecModel::test_deterministic_dmc
============================================================================================= short test summary info =============================================================================================
FAILED gensim/test/test_word2vec.py::TestWord2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_doc2vec.py::TestDoc2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_fasttext.py::TestFastTextModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
FAILED gensim/test/test_fasttext.py::TestWord2VecModel::test_negative_ns_exp - ValueError: Integers to negative integer powers are not allowed.
SKIPPED [2] gensim/test/test_word2vec.py:637: bulk test only occasionally run locally
SKIPPED [1] gensim/test/test_doc2vec.py:249: See another test for posix above
SKIPPED [1] gensim/test/test_fasttext.py:1694: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1691: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1745: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1742: FT_HOME env variable not set, skipping test
========================================================================= 4 failed, 270 passed, 7 skipped, 1 warning in 305.29s (0:05:05) =========================================================================
ERROR: InvocationError for command /home/ivan/release/gensim/.tox/py38-linux/bin/pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py (exited with code 1)
_____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________
ERROR: py38-linux: commands failed
(gensim) ivan@T490:~/release/gensim$ |
And after the fix (will be in next commit) tests succesfully passed, see f2b5db3 (gensim) ivan@T490:~/release/gensim$ tox -e py38-linux gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
py38-linux recreate: /home/ivan/release/gensim/.tox/py38-linux
py38-linux installdeps: pip>=19.1.1, .[test]
py38-linux installed: attrs==21.2.0,certifi==2021.10.8,charset-normalizer==2.0.7,Cython==0.29.24,gensim @ file:///home/ivan/release/gensim,idna==3.3,iniconfig==1.1.1,jsonpatch==1.32,jsonpointer==2.1,mock==4.0.3,Morfessor==2.0.6,nmslib==2.1.1,numpy==1.21.3,packaging==21.0,Pillow==8.4.0,pluggy==1.0.0,psutil==5.8.0,py==1.10.0,pybind11==2.6.1,pyemd==0.5.1,pyparsing==2.4.7,pytest==6.2.5,pyzmq==22.3.0,requests==2.26.0,scipy==1.7.1,six==1.16.0,smart-open==5.2.1,testfixtures==6.18.3,toml==0.10.2,torchfile==0.1.0,tornado==6.1,urllib3==1.26.7,visdom==0.1.8.9,websocket-client==1.2.1
py38-linux run-test-pre: PYTHONHASHSEED='1'
py38-linux run-test: commands[0] | python --version
Python 3.8.10
py38-linux run-test: commands[1] | pip --version
pip 21.2.4 from /home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/pip (python 3.8)
py38-linux run-test: commands[2] | python setup.py build_ext --inplace
running build_ext
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/corpora/_mmreader.cpython-38-x86_64-linux-gnu.so -> gensim/corpora
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/_matutils.cpython-38-x86_64-linux-gnu.so -> gensim
copying build/lib.linux-x86_64-3.8/gensim/models/nmf_pgd.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/similarities/fastss.cpython-38-x86_64-linux-gnu.so -> gensim/similarities
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_inner.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/word2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/fasttext_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
copying build/lib.linux-x86_64-3.8/gensim/models/doc2vec_corpusfile.cpython-38-x86_64-linux-gnu.so -> gensim/models
py38-linux run-test: commands[3] | pytest gensim/test/test_word2vec.py gensim/test/test_doc2vec.py gensim/test/test_fasttext.py
=============================================================================================== test session starts ===============================================================================================
platform linux -- Python 3.8.10, pytest-6.2.5, py-1.10.0, pluggy-1.0.0
cachedir: .tox/py38-linux/.pytest_cache
rootdir: /home/ivan/release/gensim, configfile: tox.ini
collected 281 items
gensim/test/test_word2vec.py ..........................s..................................................... [ 28%]
gensim/test/test_doc2vec.py ............................s................. [ 44%]
gensim/test/test_fasttext.py .............................................................................ssss.............................s............................................ [100%]
================================================================================================ warnings summary =================================================================================================
.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21
/home/ivan/release/gensim/.tox/py38-linux/lib/python3.8/site-packages/scipy/sparse/sparsetools.py:21: DeprecationWarning: `scipy.sparse.sparsetools` is deprecated!
scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
_deprecated()
-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================== slowest 20 durations ===============================================================================================
12.21s call gensim/test/test_doc2vec.py::TestDoc2VecModel::test_parallel
7.33s call gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs
7.19s call gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_skipgram
6.87s call gensim/test/test_fasttext.py::test_sg_hs_training[False]
6.77s call gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[False]
6.74s call gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs
5.76s call gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training
5.47s call gensim/test/test_fasttext.py::SaveFacebookFormatModelTest::test_cbow
5.39s call gensim/test/test_fasttext.py::TestWord2VecModel::test_sg_fixedwindowsize
5.35s call gensim/test/test_word2vec.py::TestWord2VecModel::test_sg_fixedwindowsize
5.24s call gensim/test/test_word2vec.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
5.17s call gensim/test/test_fasttext.py::TestWord2VecModel::test_evaluate_word_pairs_from_file
4.79s call gensim/test/test_fasttext.py::SaveGensimByteIdentityTest::test_skipgram
4.33s call gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training_fromfile
4.28s call gensim/test/test_fasttext.py::test_sg_hs_training[True]
4.26s call gensim/test/test_fasttext.py::test_cbow_hs_training_fromfile[False]
4.25s call gensim/test/test_fasttext.py::TestFastTextModel::test_sg_neg_training
4.21s call gensim/test/test_fasttext.py::test_sg_hs_training_fromfile[True]
3.95s call gensim/test/test_fasttext.py::test_cbow_hs_training[False]
3.88s call gensim/test/test_doc2vec.py::TestDoc2VecModel::test_training_fromfile
============================================================================================= short test summary info =============================================================================================
SKIPPED [2] gensim/test/test_word2vec.py:637: bulk test only occasionally run locally
SKIPPED [1] gensim/test/test_doc2vec.py:249: See another test for posix above
SKIPPED [1] gensim/test/test_fasttext.py:1694: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1691: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1745: FT_HOME env variable not set, skipping test
SKIPPED [1] gensim/test/test_fasttext.py:1742: FT_HOME env variable not set, skipping test
============================================================================== 274 passed, 7 skipped, 1 warning in 268.54s (0:04:28) ==============================================================================
_____________________________________________________________________________________________________ summary _____________________________________________________________________________________________________
py38-linux: commands succeeded
congratulations :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Tests are great, but I think an even better fix would be to just convert whatever is in It's fewer changes, solves the same issue (including in older loaded models) – and would also armor the code against a user modifying the |
Codecov Report
@@ Coverage Diff @@
## develop #3250 +/- ##
===========================================
+ Coverage 78.88% 79.00% +0.11%
===========================================
Files 68 68
Lines 11772 11772
===========================================
+ Hits 9286 9300 +14
+ Misses 2486 2472 -14
Continue to review full report at Codecov.
|
LGTM! |
Merged. Thank you @menshikh-iv !! |
Fix #3232