AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x. #2105

DennisCologne · 2018-06-26T09:38:57Z

Hello there,

Maybe you can help me out with this real quick. I cannot run any of your examples. Not the one from https://radimrehurek.com/gensim/similarities/docsim.html, nor the one from this repo. All of them give me the following Assertion.

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x.

This is not working (other similaritiy measures of this module work fine):

from gensim.test.utils import common_texts
from gensim.corpora import Dictionary
from gensim.models import Word2Vec
from gensim.similarities import SoftCosineSimilarity

model = Word2Vec(common_texts, size=20, min_count=1)  # train word-vectors
dictionary = Dictionary(common_texts)
bow_corpus = [dictionary.doc2bow(document) for document in common_texts]

similarity_matrix = model.wv.similarity_matrix(dictionary)  # construct similarity matrix
index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)

# Make a query.
query = 'graph trees computer'.split()
# calculate similarity between query and each doc from bow_corpus
sims = index[dictionary.doc2bow(query)]

Neither is this from the repo (I followed all previous steps):

similarity = softcossim(sentence_obama, sentence_orange, similarity_matrix)
print('similarity = %.4f' % similarity)

Thanks in advance. I am trying to run this for two days now but nothing works.

Best,
Dennis

The text was updated successfully, but these errors were encountered:

piskvorky · 2018-06-26T19:57:15Z

@Witiko can you have a look?

Witiko · 2018-06-26T21:23:35Z

Hey @DennisCologne,

sorry to say I am the author of the code that gives you trouble. What Gensim and Python versions are you using? I can run the above code without issue with the PyPI version of Gensim (3.4.0), and Python 3.5 just fine.

>>> sims
[(6, 0.8305764039419705),
 (7, 0.7257781024707816),
 (5, 0.5584027708699971),
 (0, 0.43455470767273646),
 (8, 0.4082457402348116),
 (1, 0.3028528215099456),
 (3, 0.09251811314306692),
 (4, 0.07636744554253587),
 (2, 0.04509321490371689)]

DennisCologne · 2018-06-27T07:53:27Z

Hi @Witiko,

thank you for your answer.

Actually, it is Python 2.7.14 with Gensim 3.4.0... after further investigation, the matrix-vector multiplication returns a negative value even though all of the values in both are positive.

But you are right, I just tried it on my Python 3.6 environment and there it works fine.
I guess I will use this environment than. But this problem might still be interesting for you.

Thanks again for the quick reply.

Best,
Dennis

Witiko · 2018-06-28T16:56:14Z

Hey @DennisCologne,

this is definitely interesting, but I can't seem to reproduce your problem even with Python 2.7 and Gensim 3.4.0. Can you find a pair of document vectors vec1, and vec2 that trigger the issue, call softcossim(vec1, vec2), and share what the content of vec1, vec2, dense_matrix, vec1len, and vec2len is just before the failing assertion?

menshikh-iv · 2018-07-31T04:08:58Z

ping @DennisCologne, please provide information for reproducing an error (that requested in #2105 (comment))

menshikh-iv · 2018-08-07T15:32:37Z

ping @DennisCologne

tvrbanec · 2019-01-25T10:37:35Z

Similar issue with SoftCosineSimilarity.
Please check at https://groups.google.com/forum/#!topic/gensim/WVTRdZONtrc
Python2.7, gensim 3.7

piskvorky · 2019-01-25T10:49:13Z

ping @Witiko

Witiko · 2019-01-25T11:17:08Z

I fail to see how this is related to the current issue, which should have been long closed due to the original poster's inactivity and the migration of the related code in Gensim 3.7.

tvrbanec · 2019-01-25T11:22:13Z

Assertion Error + SoftCosineSimilarity = Not related?
I will present the full code if You'll try to resolve the issue. Do you prefer that I open a new issue?

Witiko · 2019-01-25T11:27:21Z

The assertion error in this issue is supposed to come from the code in the pre-3.7 softcossim method, which used to reside in gensim.matutils and has since moved to the gensim.similarities.termsim module. Your issue is with the gensim.models.keyedvectors module.

tvrbanec · 2019-01-25T11:36:21Z

def softcosinesim(texts):
    model = Word2Vec(texts, size=20, min_count=1)  # train word-vectors
    termsim_index = WordEmbeddingSimilarityIndex(model)
    dictionary = Dictionary(texts)
    bow_corpus = [dictionary.doc2bow(document) for document in texts]
    similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary)  # construct similarity matrix
    docsim_index = SoftCosineSimilarity(bow_corpus, similarity_matrix, num_best=10)
    sims = docsim_index[bow_corpus]  # calculate similarity of query to each doc from bow_corpus
    return sims

Traceback (most recent call last):
termsim_index = WordEmbeddingSimilarityIndex(model)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1389, in init
assert isinstance(keyedvectors, WordEmbeddingsKeyedVectors)
AssertionError

What is wrong with this code that SoftCosineSimilarity doesn't like it? I tried to follow tutorial...

Witiko · 2019-01-25T11:39:03Z

For some reason, your word embeddings do not have the WordEmbeddingsKeyedVectors type. What type do they have?

tvrbanec · 2019-01-25T11:41:37Z

I am using gensim Word2Vec to generate w2v_model.

Witiko · 2019-01-25T12:01:10Z

Your issue above can be resolved by calling WordEmbeddingSimilarityIndex(model.wv) instead of WordEmbeddingSimilarityIndex(model). I will update the code, so that it is more aware of the distinction between BaseAny2VecModel (model) and WordEmbeddingsKeyedVectors (model.wv).

Witiko · 2019-01-25T12:44:29Z

I cannot reproduce your other issue, i.e. model.wv.similarity_matrix throwing a TypeError:

>>> from gensim.corpora import Dictionary
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> dictionary = Dictionary(common_texts)
>>> model.wv.similarity_matrix(dictionary)
<12x12 sparse matrix of type '<type 'numpy.float32'>'
        with 68 stored elements in Compressed Sparse Column format>

Can you run the above code without issue?

tvrbanec · 2019-01-25T15:46:13Z

Can you run the above code without issue?

Yes, I can.

tvrbanec · 2019-01-25T15:49:26Z

Now, few steps forward, for:
similarity_matrix = TermSimilarityMatrix(termsim_index, dictionary)
I got:
NameError: global name 'TermSimilarityMatrix' is not defined

Witiko · 2019-01-25T16:02:21Z

Please, try the following:

>>> from gensim.similarities import SparseTermSimilarityMatrix
>>> 
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)

tvrbanec · 2019-01-25T16:04:20Z

similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary) # construct similarity matrix
File "/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.py", line 234, in init
for term, similarity in index.most_similar(t1, num_rows)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 1401, in most_similar
for t2, similarity in most_similar:
TypeError: 'numpy.float32' object is not iterable

tvrbanec · 2019-01-25T16:31:56Z

Maybe the problem is creating by terms like 'chemical_element' or 'cabinet_minister' with underlines?

Witiko · 2019-01-25T16:59:41Z

I cannot reproduce your issue with new embeddings:

>>> from gensim.corpora import Dictionary
>>> from gensim.models.keyedvectors import WordEmbeddingSimilarityIndex
>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.similarities import SparseTermSimilarityMatrix
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> 
>>> termsim_index = WordEmbeddingSimilarityIndex(model.wv)
>>> dictionary = Dictionary(common_texts)
>>> similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
>>> similarity_matrix
<gensim.similarities.termsim.SparseTermSimilarityMatrix object at 0x7f822abc3d10>

Judging by the error message, model.wv.most_similar returns a number, not an iterable. Can you print the result of model.wv.most_similar(positive=['chemical_element'], topn=2), please?

tvrbanec · 2019-01-25T17:16:37Z

For common_texts, output is:

model.wv.most_similar(positive=['chemical_element'], topn=2)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-13a509737ea2> in <module>()
----> 1 model.wv.most_similar(positive=['chemical_element'], topn=2)

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
    541                 mean.append(weight * word)
    542             else:
--> 543                 mean.append(weight * self.word_vec(word, use_norm=True))
    544                 if word in self.vocab:
    545                     all_words.add(self.vocab[word].index)

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in word_vec(self, word, use_norm)
    462             return result
    463         else:
--> 464             raise KeyError("word '%s' not in vocabulary" % word)
    465 
    466     def get_vector(self, word):

KeyError: "word 'chemical_element' not in vocabulary"

Witiko · 2019-01-25T17:18:02Z

Can you please try with the embeddings that throw the TypeError: 'numpy.float32' object is not iterable exception? I understand that these should contain an embedding for the word chemical_element.

tvrbanec · 2019-01-25T17:18:59Z

For my text it stops even before:

In [12]: similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-12-abfb8b1569f4> in <module>()
----> 1 similarity_matrix = SparseTermSimilarityMatrix(termsim_index, dictionary)

/usr/local/lib/python2.7/dist-packages/gensim/similarities/termsim.pyc in __init__(self, source, dictionary, tfidf, symmetric, positive_definite, nonzero_limit, dtype)
    232             most_similar = [
    233                 (dictionary.token2id[term], similarity)
--> 234                 for term, similarity in index.most_similar(t1, num_rows)
    235                 if term in dictionary.token2id]
    236 

/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.pyc in most_similar(self, t1, topn)
   1399         else:
   1400             most_similar = self.keyedvectors.most_similar(positive=[t1], topn=topn, **self.kwargs)
-> 1401             for t2, similarity in most_similar:
   1402                 if similarity > self.threshold:
   1403                     yield (t2, similarity**self.exponent)

TypeError: 'numpy.float32' object is not iterable

Witiko · 2019-01-25T17:21:13Z

As you can see on line 1400 in the error message above, SparseTermSimilarityMatrix calls model.wv.most_similar internally. According to the error message, the result of calling model.wv.most_similar is a float, not an iterable. This is highly suspect.

Therefore, can you please print the result of model.wv.most_similar(positive=['chemical_element'], topn=2) instead of calling the SparseTermSimilarityMatrix constructor? As you noted, there is no issue when you construct the model using common_texts, so this seems to be an issue with your embeddings.

tvrbanec · 2019-01-25T17:24:20Z

Thank you for your patience: :)

In [13]: model.wv.most_similar(positive=['chemical_element'], topn=2)
Out[13]: [('inhabitant', 0.93882817029953), ('the', 0.9326512813568115)]

Witiko · 2019-01-25T17:27:02Z

This seems pretty iterable to me.

tvrbanec · 2019-01-25T17:32:56Z

Does my text make an error at your computer?

Witiko · 2019-01-25T17:34:00Z

Let's try to closely imitate the call on line 1400. Can you please print the result of the following:

>>> termsim_index.kwargs
>>> termsim_index.keyedvectors
>>> most_similar = termsim_index.keyedvectors.most_similar(positive=['chemical_element'], topn=100)
>>> most_similar
>>> type(most_similar)
>>> '__iter__' in most_similar

Witiko · 2019-01-25T17:34:21Z

My text does not make an error at your computer?

~~What is your text?~~ Nevermind, I see it now.

tvrbanec · 2019-01-25T17:37:57Z

In [15]: termsim_index.kwargs
Out[15]: {}

In [16]: termsim_index.keyedvectors
Out[16]: <gensim.models.keyedvectors.Word2VecKeyedVectors at 0x7f18fca001d0>

In [17]: most_similar = termsim_index.keyedvectors.most_similar(positive=['chemical_element'], topn=100)

In [18]: model.wv.most_similar(positive=['chemical_element'], topn=2)
Out[18]: [('inhabitant', 0.93882817029953), ('the', 0.9326512813568115)]

In [19]: most_similar
Out[19]: 
[('inhabitant', 0.93882817029953),
 ('the', 0.9326512813568115),
 ('give', 0.93118816614151),
 ('have', 0.9303354620933533),
 ('act', 0.928438663482666),
 ('one', 0.9224538803100586),
 ('to', 0.9192750453948975),
 ('china', 0.9141452312469482),
 ('associate_degree', 0.9119006395339966),
 ('with', 0.9078292846679688),
 ('be', 0.9045330286026001),
 ('statement', 0.898809552192688),
 ('which', 0.8987339735031128),
 ('vote', 0.89339280128479),
 ('time_period', 0.89242023229599),
 ('of', 0.8907227516174316),
 ('playing_card', 0.8895689249038696),
 ('first', 0.8888296484947205),
 ('oregon', 0.8866477608680725),
 ('merkel', 0.8860795497894287),
 ('person', 0.8851599097251892),
 ('from', 0.8846421241760254),
 ('in', 0.8816125988960266),
 ('and', 0.8810371160507202),
 ('this', 0.8801096081733704),
 ('make', 0.8777452707290649),
 ('meet', 0.8769802451133728),
 ('besides', 0.8752848505973816),
 ('angular_distance', 0.873124361038208),
 ('that', 0.8714672327041626),
 ('on', 0.8699379563331604),
 ('other', 0.8691580891609192),
 ('change', 0.8684202432632446),
 ('obama', 0.8667253851890564),
 ('communication', 0.8621015548706055),
 ('engineering', 0.8615524172782898),
 ('some', 0.8598195314407349),
 ('now', 0.8572754859924316),
 ('exchange', 0.8560868501663208),
 ('for', 0.8554658889770508),
 ('title', 0.8532639741897583),
 ('express', 0.8532208204269409),
 ('right', 0.8518909811973572),
 ('head_of_state', 0.847177267074585),
 ('free', 0.846038281917572),
 ('remove', 0.8458209037780762),
 ('germany', 0.8454596996307373),
 ('union', 0.8446109294891357),
 ('would', 0.8416316509246826),
 ('faculty', 0.8411930799484253),
 ('weekday', 0.8399801850318909),
 ('merely', 0.8379250764846802),
 ('we', 0.8371882438659668),
 ('political_unit', 0.8370255827903748),
 ('work', 0.8348655104637146),
 ('take', 0.8348475694656372),
 ('administrative_district', 0.8343826532363892),
 ('tpp', 0.833882749080658),
 ('administrator', 0.8318067789077759),
 ('united_nations_agency', 0.8316440582275391),
 ('washington', 0.8313312530517578),
 ('politician', 0.8289576768875122),
 ('legislature', 0.8287457227706909),
 ('plan_of_action', 0.8201491832733154),
 ('management', 0.8187181949615479),
 ('federal', 0.8167140483856201),
 ('new', 0.8154265880584717),
 ('travel', 0.8148607015609741),
 ('not', 0.8135936856269836),
 ('about', 0.8135201334953308),
 ('republican', 0.8131340742111206),
 ('him', 0.8047671318054199),
 ('by', 0.8038091659545898),
 ('associate', 0.8037841320037842),
 ('activity', 0.8029162287712097),
 ('structure', 0.8025172352790833),
 ('pacific', 0.799057126045227),
 ('point', 0.7987416982650757),
 ('more', 0.7969338893890381),
 ('message', 0.7965559959411621),
 ('organization', 0.7899693250656128),
 ('digit', 0.7889872789382935),
 ('connect', 0.7889586687088013),
 ('when', 0.7868154048919678),
 ('result', 0.7862980961799622),
 ('his', 0.7852383852005005),
 ('they', 0.783265233039856),
 ('schulz', 0.7814303636550903),
 ('group_action', 0.7772569060325623),
 ('european', 0.7769173979759216),
 ('large_integer', 0.775283932685852),
 ('under', 0.7743880748748779),
 ('inform', 0.771774172782898),
 ('mexico', 0.7684292793273926),
 ('against', 0.7668302059173584),
 ('steinmeier', 0.7626404762268066),
 ('supply', 0.7593228816986084),
 ('better', 0.7585717439651489),
 ('support', 0.7579919695854187),
 ('change_state', 0.7550258636474609)]

In [20]: type(most_similar)
Out[20]: list

In [21]: '__iter__' in most_similar
Out[21]: False

Witiko · 2019-01-25T17:39:50Z

I can reproduce this with your text and I am investigating.

Witiko · 2019-01-25T17:53:01Z

The issue is that the most_similar method returns weird results with topn=0:

>>> from gensim.models.word2vec import Word2Vec
>>> from gensim.test.utils import common_texts
>>> 
>>> model = Word2Vec(common_texts, size=20, min_count=1)
>>> model.wv.most_similar(positive=['computer'], topn=2)
[('response', 0.38100379705429077), ('minors', 0.3752439618110657)]
>>> model.wv.most_similar(positive=['computer'], topn=0)
array([-0.1180886 ,  0.32174808, -0.02938104, -0.21145007,  0.37524396,
       -0.23777878,  0.99999994, -0.01436211,  0.36708638, -0.09770551,
        0.05963777,  0.3810038 ], dtype=float32)

This is an undocumented behavior, which can be fixed by removing lines 554 and 555 in keyedvectors.py. Sadly, I don't see how a caller can easily patch this up without changing the package code. Afterwards, you will get the expected result and, more importantly, SparseTermSimilarityMatrix should now work.

>>> model.wv.most_similar(positive=['computer'], topn=0)
[]

Witiko · 2019-01-25T18:12:36Z

The patches are now available in #2356. Thank you for your patience in helping discover the bug and sorry for the trouble. 😉

Vineet-Sharma29 · 2020-07-28T18:19:41Z

I have following code:-

model = KeyedVectors.load_word2vec_format('/home/vineet/Downloads/lemmatized-legal/no replacement/legal_lemmatized_no_replacement.bin', binary=True)

bow_corpus, doc_dict = corpora.MmCorpus('./bow_corpus.mm'), corpora.Dictionary.load('./doc_dict.dict')

# compute cosine similarity between word embeddings
termsim_index = WordEmbeddingSimilarityIndex(model)

# construct term similarity matrix
similarity_matrix = SparseTermSimilarityMatrix(termsim_index, doc_dict)

And it gives me following error:-

File "word2vec.py", line 25, in <module>
    similarity_matrix = SparseTermSimilarityMatrix(termsim_index, doc_dict)
  File "/home/vineet/.local/lib/python3.6/site-packages/gensim/similarities/termsim.py", line 264, in __init__
    100.0 * matrix.getnnz() / matrix_order**2)
ZeroDivisionError: float division by zero

What can be probable reasons for it and how to resolve it?

Witiko · 2020-07-28T18:30:46Z

It seems as though your matrix_order is zero, which would indicate that your doc_dict dictionary is empty, can you verify?
We should check for this and raise a ValueError with a user-friendly message earlier in the constructor.

menshikh-iv added the need info Not enough information for reproduce an issue, need more info from author label Jul 31, 2018

Witiko mentioned this issue Jan 25, 2019

Fix WordEmbeddingsKeyedVectors.most_similar #2356

Merged

piskvorky closed this as completed Feb 2, 2019

Witiko mentioned this issue May 17, 2019

SparseTermSimilarityMatrix - TypeError: 'numpy.float32' object is not iterable #2496

Closed

Witiko mentioned this issue Jul 29, 2020

Reduce memory use of the term similarity matrix constructor, deprecate the positive_definite parameter, and extend normalization capabilities of the inner_product method #2783

Merged

Repository owner locked as resolved and limited conversation to collaborators Jul 29, 2020

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x. #2105

AssertionError: sparse documents must not contain any explicit zero entries and the similarity matrix S must satisfy x^T * S * x > 0 for any nonzero bag-of-words vector x. #2105

Comments

DennisCologne commented Jun 26, 2018 • edited by menshikh-iv Loading

piskvorky commented Jun 26, 2018

Witiko commented Jun 26, 2018 • edited Loading

DennisCologne commented Jun 27, 2018

Witiko commented Jun 28, 2018 • edited Loading

menshikh-iv commented Jul 31, 2018

menshikh-iv commented Aug 7, 2018

tvrbanec commented Jan 25, 2019

piskvorky commented Jan 25, 2019

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019

tvrbanec commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019

tvrbanec commented Jan 25, 2019

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019

tvrbanec commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

tvrbanec commented Jan 25, 2019

Witiko commented Jan 25, 2019

Witiko commented Jan 25, 2019 • edited Loading

Witiko commented Jan 25, 2019 • edited Loading

Vineet-Sharma29 commented Jul 28, 2020

Witiko commented Jul 28, 2020

DennisCologne commented Jun 26, 2018 •

edited by menshikh-iv

Loading

Witiko commented Jun 26, 2018 •

edited

Loading

Witiko commented Jun 28, 2018 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

tvrbanec commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading

Witiko commented Jan 25, 2019 •

edited

Loading