Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SMART from TfidfModel for case when df == "n". Fix #2020 #2021

Merged
merged 31 commits into from
Apr 10, 2018
Merged

Fix SMART from TfidfModel for case when df == "n". Fix #2020 #2021

merged 31 commits into from
Apr 10, 2018

Conversation

PeteBleackley
Copy link
Contributor

Fixed bad local scaling, as described in issue #2020

@menshikh-iv
Copy link
Contributor

Nice catch, thanks for investigation @PeteBleackley 👍

CC: @markroxor

@menshikh-iv menshikh-iv changed the title Fix bug in TfidfModel Fix SMART from TfidfModel for case when df == "n". Fix #2020 Apr 6, 2018
@menshikh-iv
Copy link
Contributor

@PeteBleackley can you fix the broken test, please?

@menshikh-iv
Copy link
Contributor

@PeteBleackley last broken

_______________________ TestTfidfModel.test_consistency _______________________
322
323self = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_consistency>
324
325    def test_consistency(self):
326        docs = [corpus[1], corpus[2]]
327    
328        # Test if `ntc` yields the default docs.
329        model = tfidfmodel.TfidfModel(self.corpus, smartirs='ntc')
330        transformed_docs = [model[docs[0]], model[docs[1]]]
331    
332        model = tfidfmodel.TfidfModel(self.corpus)
333        expected_docs = [model[docs[0]], model[docs[1]]]
334    
335        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
336        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
337    
338        # Testing all the variations of `wlocal`
339        # nnn
340        model = tfidfmodel.TfidfModel(self.corpus, smartirs='nnn')
341        transformed_docs = [model[docs[0]], model[docs[1]]]
342        expected_docs = docs[:]
343    
344        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
345        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
346    
347        # lnn
348        model = tfidfmodel.TfidfModel(self.corpus, smartirs='lnn')
349        transformed_docs = [model[docs[0]], model[docs[1]]]
350        expected_docs = [
351            [(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)],
352            [(5, 2.0), (9, 1.0), (10, 1.0)]
353        ]
354    
355        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
356        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
357    
358        # ann
359        model = tfidfmodel.TfidfModel(self.corpus, smartirs='ann')
360        transformed_docs = [model[docs[0]], model[docs[1]]]
361        expected_docs = [
362            [(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)],
363            [(5, 1.0), (9, 0.75), (10, 0.75)]
364        ]
365    
366        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
367        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
368    
369        # bnn
370        model = tfidfmodel.TfidfModel(self.corpus, smartirs='bnn')
371        transformed_docs = [model[docs[0]], model[docs[1]]]
372        expected_docs = [
373            [(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
374            [(5, 1), (9, 1), (10, 1)]
375        ]
376    
377        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
378        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
379    
380        # Lnn
381        model = tfidfmodel.TfidfModel(self.corpus, smartirs='Lnn')
382        transformed_docs = [model[docs[0]], model[docs[1]]]
383        expected_docs = [
384            [
385                (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0),
386                (7, 1.0), (8, 1.0)
387            ],
388            [
389                (5, 3.627141918134611), (9, 1.8135709590673055), (10, 1.8135709590673055)
390            ]
391        ]
392    
393>       self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
394E       AssertionError: False is not true
395
396docs       = [[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(5, 2), (9, 1), (10, 1)]]
397expected_docs = [[(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)], [(5, 3.627141918134611), (9, 1.8135709590673055), (10, 1.8135709590673055)]]
398model      = <gensim.models.tfidfmodel.TfidfModel object at 0x07004030>
399self       = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_consistency>
400transformed_docs = [[(3, 0.73178964131150992), (4, 0.73178964131150992), (5, 0.73178964131150992), (6, 0.73178964131150992), (7, 0.73178964131150992), (8, 0.73178964131150992)], [(5, 1.2090473060448703), (9, 0.60452365302243516), (10, 0.60452365302243516)]]
401

@menshikh-iv
Copy link
Contributor

@PeteBleackley test_consistency is very long, can you first run it on your local machine & fix all problem (too many asserts in this test if you'll fix 1 assert per push - It will take a long time with CI and so on).

@PeteBleackley
Copy link
Contributor Author

How do I get test_tfidfmodel.py to import gensim from my local git repository rather than my installed copy?

@PeteBleackley
Copy link
Contributor Author

Also, can you please check these calculations of the expected values for Lnn for me?
Lnn.xlsx

@menshikh-iv
Copy link
Contributor

@PeteBleackley install gensim directly from your local repository

# cd to the repository folder
pip install -e .

@menshikh-iv
Copy link
Contributor

@PeteBleackley about check calculation - of course, but not now (probably next week, sorry). Can you share Lnn.xlsx as google-sheet please (libre-office on my Linux doesn't work correctly with your file)?

@PeteBleackley
Copy link
Contributor Author

PeteBleackley commented Apr 6, 2018

Found the problem with Lnn - a closing bracket was in the wrong place.
Now have to check the calculations for ntn - which is odd because I haven't changed that branch of the code.

@PeteBleackley
Copy link
Contributor Author

OK, the problem with ntn appears to be in the testcorpus.mm file. Can fix this by calling the constructor with corpus instead of self.corpus.
I now have two problems with test_persistence and test_persistence_compressed.

@markroxor
Copy link
Contributor

Please close this PR as per the comment.

@PeteBleackley
Copy link
Contributor Author

@markroxor the current implementation doesn't match what is described in https://en.m.wikipedia.org/wiki/SMART_Information_Retrieval_System, and crashes when *nc is used with large corpora.

…cement of a parenthesis. Updated predicted values for unit tests
@PeteBleackley
Copy link
Contributor Author

Still having problems with test_persistence and test_persistence_compressed, but I've checked in the latest version of the code so that other people can see the latest version and hopefully suggest a solution to the problem with those tests

@@ -169,103 +171,100 @@ def test_consistency(self):
docs = [corpus[1], corpus[2]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a self. to corpus here and everywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed self.corpus to corpus because my calculations for expected_docs are based on the corpus given at the top of test_tfidfmodel.py, and self.corpus (based on test_data/test_corpus.mm) doesn't give matching results

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If self.corpus is not used anywhere I suggest removing it from the setUp method.
@menshikh-iv please suggest the corpus which should be chosen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to you guys, it is not so important

Copy link
Contributor

@markroxor markroxor Apr 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one last step, since we are not using testcorpus.mm (defined in the setUp method) can you replace it with text, dictionary and corpus defined at the top of the code.
After that it is ready to merge @menshikh-iv

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still using self.corpus in the persistence tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use either self.corpus or corpus, the redundancy is polluting the code.

@markroxor
Copy link
Contributor

markroxor commented Apr 8, 2018

For testing your code locally I will recommend using tox -- gensim/test/test_tfidfmodel.py command in linux. Install it using pip pip install tox.

Copy link
Contributor

@markroxor markroxor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run flake8 gensim/ locally before pushing to check for pep8 errors.

@@ -118,6 +118,7 @@ def precompute_idfs(wglobal, dfs, total_docs):
# not strictly necessary and could be computed on the fly in TfidfModel__getitem__.
# this method is here just to speed things up a little.
return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this extra line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Various blank lines are the ghosts of prints put in during debugging.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all pep8 errors which will error out your builds. Please use flake8 gensim/ command as suggested before.

elif n_df == "p":
return np.log((1.0 * totaldocs - docfreq) / docfreq) / np.log(2)
return max(0,np.log2((1.0 * totaldocs - docfreq) / docfreq))
Copy link
Contributor

@markroxor markroxor Apr 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return max(0, np.log2((1.0 * totaldocs - docfreq) / docfreq))

@@ -295,6 +297,7 @@ def __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=utils.iden
self.id2word = id2word
self.wlocal, self.wglobal, self.normalize = wlocal, wglobal, normalize
self.num_docs, self.num_nnz, self.idfs = None, None, None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove whitespaces in this line.

@@ -371,6 +374,7 @@ def initialize(self, corpus):
numnnz += len(bow)
for termid, _ in bow:
dfs[termid] = dfs.get(termid, 0) + 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

@@ -39,6 +40,7 @@ class TestTfidfModel(unittest.TestCase):
def setUp(self):
self.corpus = MmCorpus(datapath('testcorpus.mm'))


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

]
]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove whitespaces in this line.

Copy link
Contributor

@markroxor markroxor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look here

gensim/test/test_tfidfmodel.py:115: AssertionError
__________________ TestTfidfModel.test_persistence_compressed __________________
self = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_persistence_compressed>
    def test_persistence_compressed(self):
        # Test persistence without using `smartirs`
        fname = get_tmpfile('gensim_models.tst.gz')
        model = tfidfmodel.TfidfModel(self.corpus, normalize=True)
        model.save(fname)
        model2 = tfidfmodel.TfidfModel.load(fname, mmap=None)
        self.assertTrue(model.idfs == model2.idfs)
        tstvec = [corpus[1], corpus[2]]
        self.assertTrue(np.allclose(model[tstvec[0]], model2[tstvec[0]]))
        self.assertTrue(np.allclose(model[tstvec[1]], model2[tstvec[1]]))
        self.assertTrue(np.allclose(model[[]], model2[[]]))  # try projecting an empty vector
    
        # Test persistence with using `smartirs`
        fname = get_tmpfile('gensim_models_smartirs.tst.gz')
        model = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc")
        model.save(fname)
        model2 = tfidfmodel.TfidfModel.load(fname, mmap=None)
        self.assertTrue(model.idfs == model2.idfs)
        tstvec = [corpus[1], corpus[2]]
        self.assertTrue(np.allclose(model[tstvec[0]], model2[tstvec[0]]))
        self.assertTrue(np.allclose(model[tstvec[1]], model2[tstvec[1]]))
        self.assertTrue(np.allclose(model[[]], model2[[]]))  # try projecting an empty vector
    
        # Test persistence between Gensim v3.2.0 and current compressed model.
        model3 = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc")
        model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst.bz2'))
>       self.assertTrue(model3.idfs == model4.idfs)
E       AssertionError: False is not true
fname      = '/tmp/gensim_models_smartirs.tst.gz'
model      = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1f10>
model2     = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1e50>
model3     = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1910>
model4     = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1f90>
self       = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_persistence_compressed>
tstvec     = [[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(5, 2), (9, 1), (10, 1)]]

Tests are failing because files tfidf_model.tst.bz2 and tfidf_model.tst were pushed by me. Too maintain the persistence you need to create these files manually for the modified version.
Hint - I will suggest using model.save(fname).

Also while you are on it please use more intuitive file name while creating the model like - tfidf_smartirs_model.tst

@menshikh-iv
Copy link
Contributor

Thank you @PeteBleackley for detailed investigation, LGTM for me, do you agree @markroxor?

@menshikh-iv menshikh-iv merged commit 06f5f5c into piskvorky:develop Apr 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants