-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix SMART from TfidfModel for case when df == "n"
. Fix #2020
#2021
Conversation
This reverts commit 6add3ba.
Nice catch, thanks for investigation @PeteBleackley 👍 CC: @markroxor |
df == "n"
. Fix #2020
@PeteBleackley can you fix the broken test, please? |
@PeteBleackley last broken _______________________ TestTfidfModel.test_consistency _______________________
322
323self = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_consistency>
324
325 def test_consistency(self):
326 docs = [corpus[1], corpus[2]]
327
328 # Test if `ntc` yields the default docs.
329 model = tfidfmodel.TfidfModel(self.corpus, smartirs='ntc')
330 transformed_docs = [model[docs[0]], model[docs[1]]]
331
332 model = tfidfmodel.TfidfModel(self.corpus)
333 expected_docs = [model[docs[0]], model[docs[1]]]
334
335 self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
336 self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
337
338 # Testing all the variations of `wlocal`
339 # nnn
340 model = tfidfmodel.TfidfModel(self.corpus, smartirs='nnn')
341 transformed_docs = [model[docs[0]], model[docs[1]]]
342 expected_docs = docs[:]
343
344 self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
345 self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
346
347 # lnn
348 model = tfidfmodel.TfidfModel(self.corpus, smartirs='lnn')
349 transformed_docs = [model[docs[0]], model[docs[1]]]
350 expected_docs = [
351 [(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)],
352 [(5, 2.0), (9, 1.0), (10, 1.0)]
353 ]
354
355 self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
356 self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
357
358 # ann
359 model = tfidfmodel.TfidfModel(self.corpus, smartirs='ann')
360 transformed_docs = [model[docs[0]], model[docs[1]]]
361 expected_docs = [
362 [(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)],
363 [(5, 1.0), (9, 0.75), (10, 0.75)]
364 ]
365
366 self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
367 self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
368
369 # bnn
370 model = tfidfmodel.TfidfModel(self.corpus, smartirs='bnn')
371 transformed_docs = [model[docs[0]], model[docs[1]]]
372 expected_docs = [
373 [(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
374 [(5, 1), (9, 1), (10, 1)]
375 ]
376
377 self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
378 self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
379
380 # Lnn
381 model = tfidfmodel.TfidfModel(self.corpus, smartirs='Lnn')
382 transformed_docs = [model[docs[0]], model[docs[1]]]
383 expected_docs = [
384 [
385 (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0),
386 (7, 1.0), (8, 1.0)
387 ],
388 [
389 (5, 3.627141918134611), (9, 1.8135709590673055), (10, 1.8135709590673055)
390 ]
391 ]
392
393> self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
394E AssertionError: False is not true
395
396docs = [[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(5, 2), (9, 1), (10, 1)]]
397expected_docs = [[(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)], [(5, 3.627141918134611), (9, 1.8135709590673055), (10, 1.8135709590673055)]]
398model = <gensim.models.tfidfmodel.TfidfModel object at 0x07004030>
399self = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_consistency>
400transformed_docs = [[(3, 0.73178964131150992), (4, 0.73178964131150992), (5, 0.73178964131150992), (6, 0.73178964131150992), (7, 0.73178964131150992), (8, 0.73178964131150992)], [(5, 1.2090473060448703), (9, 0.60452365302243516), (10, 0.60452365302243516)]]
401 |
@PeteBleackley |
How do I get test_tfidfmodel.py to import gensim from my local git repository rather than my installed copy? |
Also, can you please check these calculations of the expected values for Lnn for me? |
@PeteBleackley install gensim directly from your local repository # cd to the repository folder
pip install -e . |
@PeteBleackley about check calculation - of course, but not now (probably next week, sorry). Can you share |
Found the problem with Lnn - a closing bracket was in the wrong place. |
OK, the problem with ntn appears to be in the testcorpus.mm file. Can fix this by calling the constructor with corpus instead of self.corpus. |
Please close this PR as per the comment. |
@markroxor the current implementation doesn't match what is described in https://en.m.wikipedia.org/wiki/SMART_Information_Retrieval_System, and crashes when *nc is used with large corpora. |
…cement of a parenthesis. Updated predicted values for unit tests
Still having problems with test_persistence and test_persistence_compressed, but I've checked in the latest version of the code so that other people can see the latest version and hopefully suggest a solution to the problem with those tests |
@@ -169,103 +171,100 @@ def test_consistency(self): | |||
docs = [corpus[1], corpus[2]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a self.
to corpus
here and everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed self.corpus
to corpus
because my calculations for expected_docs are based on the corpus given at the top of test_tfidfmodel.py, and self.corpus
(based on test_data/test_corpus.mm) doesn't give matching results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If self.corpus
is not used anywhere I suggest removing it from the setUp
method.
@menshikh-iv please suggest the corpus which should be chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to you guys, it is not so important
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just one last step, since we are not using testcorpus.mm
(defined in the setUp
method) can you replace it with text
, dictionary
and corpus
defined at the top of the code.
After that it is ready to merge @menshikh-iv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ping @PeteBleackley
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still using self.corpus in the persistence tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use either self.corpus
or corpus
, the redundancy is polluting the code.
For testing your code locally I will recommend using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please run flake8 gensim/
locally before pushing to check for pep8 errors.
gensim/models/tfidfmodel.py
Outdated
@@ -118,6 +118,7 @@ def precompute_idfs(wglobal, dfs, total_docs): | |||
# not strictly necessary and could be computed on the fly in TfidfModel__getitem__. | |||
# this method is here just to speed things up a little. | |||
return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this extra line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Various blank lines are the ghosts of print
s put in during debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are all pep8 errors which will error out your builds. Please use flake8 gensim/
command as suggested before.
gensim/models/tfidfmodel.py
Outdated
elif n_df == "p": | ||
return np.log((1.0 * totaldocs - docfreq) / docfreq) / np.log(2) | ||
return max(0,np.log2((1.0 * totaldocs - docfreq) / docfreq)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return max(0, np.log2((1.0 * totaldocs - docfreq) / docfreq))
gensim/models/tfidfmodel.py
Outdated
@@ -295,6 +297,7 @@ def __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=utils.iden | |||
self.id2word = id2word | |||
self.wlocal, self.wglobal, self.normalize = wlocal, wglobal, normalize | |||
self.num_docs, self.num_nnz, self.idfs = None, None, None | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove whitespaces in this line.
gensim/models/tfidfmodel.py
Outdated
@@ -371,6 +374,7 @@ def initialize(self, corpus): | |||
numnnz += len(bow) | |||
for termid, _ in bow: | |||
dfs[termid] = dfs.get(termid, 0) + 1 | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
gensim/test/test_tfidfmodel.py
Outdated
@@ -39,6 +40,7 @@ class TestTfidfModel(unittest.TestCase): | |||
def setUp(self): | |||
self.corpus = MmCorpus(datapath('testcorpus.mm')) | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
gensim/test/test_tfidfmodel.py
Outdated
] | ||
] | ||
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove whitespaces in this line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look here
gensim/test/test_tfidfmodel.py:115: AssertionError
__________________ TestTfidfModel.test_persistence_compressed __________________
self = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_persistence_compressed>
def test_persistence_compressed(self):
# Test persistence without using `smartirs`
fname = get_tmpfile('gensim_models.tst.gz')
model = tfidfmodel.TfidfModel(self.corpus, normalize=True)
model.save(fname)
model2 = tfidfmodel.TfidfModel.load(fname, mmap=None)
self.assertTrue(model.idfs == model2.idfs)
tstvec = [corpus[1], corpus[2]]
self.assertTrue(np.allclose(model[tstvec[0]], model2[tstvec[0]]))
self.assertTrue(np.allclose(model[tstvec[1]], model2[tstvec[1]]))
self.assertTrue(np.allclose(model[[]], model2[[]])) # try projecting an empty vector
# Test persistence with using `smartirs`
fname = get_tmpfile('gensim_models_smartirs.tst.gz')
model = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc")
model.save(fname)
model2 = tfidfmodel.TfidfModel.load(fname, mmap=None)
self.assertTrue(model.idfs == model2.idfs)
tstvec = [corpus[1], corpus[2]]
self.assertTrue(np.allclose(model[tstvec[0]], model2[tstvec[0]]))
self.assertTrue(np.allclose(model[tstvec[1]], model2[tstvec[1]]))
self.assertTrue(np.allclose(model[[]], model2[[]])) # try projecting an empty vector
# Test persistence between Gensim v3.2.0 and current compressed model.
model3 = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc")
model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst.bz2'))
> self.assertTrue(model3.idfs == model4.idfs)
E AssertionError: False is not true
fname = '/tmp/gensim_models_smartirs.tst.gz'
model = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1f10>
model2 = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1e50>
model3 = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1910>
model4 = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1f90>
self = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_persistence_compressed>
tstvec = [[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(5, 2), (9, 1), (10, 1)]]
Tests are failing because files tfidf_model.tst.bz2
and tfidf_model.tst
were pushed by me. Too maintain the persistence you need to create these files manually for the modified version.
Hint - I will suggest using model.save(fname)
.
Also while you are on it please use more intuitive file name while creating the model like - tfidf_smartirs_model.tst
Thank you @PeteBleackley for detailed investigation, LGTM for me, do you agree @markroxor? |
Fixed bad local scaling, as described in issue #2020