Fix SMART from TfidfModel for case when `df == "n"`. Fix #2020 #2021

PeteBleackley · 2018-04-06T10:51:22Z

Fixed bad local scaling, as described in issue #2020

…ithm

This reverts commit 6add3ba.

menshikh-iv · 2018-04-06T10:52:44Z

Nice catch, thanks for investigation @PeteBleackley 👍

CC: @markroxor

menshikh-iv · 2018-04-06T11:24:10Z

@PeteBleackley can you fix the broken test, please?

…o develop

menshikh-iv · 2018-04-06T13:00:17Z

@PeteBleackley last broken

_______________________ TestTfidfModel.test_consistency _______________________
322
323self = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_consistency>
324
325    def test_consistency(self):
326        docs = [corpus[1], corpus[2]]
327    
328        # Test if `ntc` yields the default docs.
329        model = tfidfmodel.TfidfModel(self.corpus, smartirs='ntc')
330        transformed_docs = [model[docs[0]], model[docs[1]]]
331    
332        model = tfidfmodel.TfidfModel(self.corpus)
333        expected_docs = [model[docs[0]], model[docs[1]]]
334    
335        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
336        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
337    
338        # Testing all the variations of `wlocal`
339        # nnn
340        model = tfidfmodel.TfidfModel(self.corpus, smartirs='nnn')
341        transformed_docs = [model[docs[0]], model[docs[1]]]
342        expected_docs = docs[:]
343    
344        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
345        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
346    
347        # lnn
348        model = tfidfmodel.TfidfModel(self.corpus, smartirs='lnn')
349        transformed_docs = [model[docs[0]], model[docs[1]]]
350        expected_docs = [
351            [(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)],
352            [(5, 2.0), (9, 1.0), (10, 1.0)]
353        ]
354    
355        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
356        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
357    
358        # ann
359        model = tfidfmodel.TfidfModel(self.corpus, smartirs='ann')
360        transformed_docs = [model[docs[0]], model[docs[1]]]
361        expected_docs = [
362            [(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)],
363            [(5, 1.0), (9, 0.75), (10, 0.75)]
364        ]
365    
366        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
367        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
368    
369        # bnn
370        model = tfidfmodel.TfidfModel(self.corpus, smartirs='bnn')
371        transformed_docs = [model[docs[0]], model[docs[1]]]
372        expected_docs = [
373            [(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)],
374            [(5, 1), (9, 1), (10, 1)]
375        ]
376    
377        self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
378        self.assertTrue(np.allclose(transformed_docs[1], expected_docs[1]))
379    
380        # Lnn
381        model = tfidfmodel.TfidfModel(self.corpus, smartirs='Lnn')
382        transformed_docs = [model[docs[0]], model[docs[1]]]
383        expected_docs = [
384            [
385                (3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0),
386                (7, 1.0), (8, 1.0)
387            ],
388            [
389                (5, 3.627141918134611), (9, 1.8135709590673055), (10, 1.8135709590673055)
390            ]
391        ]
392    
393>       self.assertTrue(np.allclose(transformed_docs[0], expected_docs[0]))
394E       AssertionError: False is not true
395
396docs       = [[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(5, 2), (9, 1), (10, 1)]]
397expected_docs = [[(3, 1.0), (4, 1.0), (5, 1.0), (6, 1.0), (7, 1.0), (8, 1.0)], [(5, 3.627141918134611), (9, 1.8135709590673055), (10, 1.8135709590673055)]]
398model      = <gensim.models.tfidfmodel.TfidfModel object at 0x07004030>
399self       = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_consistency>
400transformed_docs = [[(3, 0.73178964131150992), (4, 0.73178964131150992), (5, 0.73178964131150992), (6, 0.73178964131150992), (7, 0.73178964131150992), (8, 0.73178964131150992)], [(5, 1.2090473060448703), (9, 0.60452365302243516), (10, 0.60452365302243516)]]
401

menshikh-iv · 2018-04-06T13:53:19Z

@PeteBleackley test_consistency is very long, can you first run it on your local machine & fix all problem (too many asserts in this test if you'll fix 1 assert per push - It will take a long time with CI and so on).

PeteBleackley · 2018-04-06T14:04:28Z

How do I get test_tfidfmodel.py to import gensim from my local git repository rather than my installed copy?

PeteBleackley · 2018-04-06T14:12:29Z

Also, can you please check these calculations of the expected values for Lnn for me?
Lnn.xlsx

menshikh-iv · 2018-04-06T14:12:58Z

@PeteBleackley install gensim directly from your local repository

# cd to the repository folder
pip install -e .

menshikh-iv · 2018-04-06T14:16:27Z

@PeteBleackley about check calculation - of course, but not now (probably next week, sorry). Can you share Lnn.xlsx as google-sheet please (libre-office on my Linux doesn't work correctly with your file)?

PeteBleackley · 2018-04-06T15:46:13Z

Found the problem with Lnn - a closing bracket was in the wrong place.
Now have to check the calculations for ntn - which is odd because I haven't changed that branch of the code.

PeteBleackley · 2018-04-06T22:35:38Z

OK, the problem with ntn appears to be in the testcorpus.mm file. Can fix this by calling the constructor with corpus instead of self.corpus.
I now have two problems with test_persistence and test_persistence_compressed.

markroxor · 2018-04-07T09:26:33Z

Please close this PR as per the comment.

PeteBleackley · 2018-04-07T17:38:13Z

@markroxor the current implementation doesn't match what is described in https://en.m.wikipedia.org/wiki/SMART_Information_Retrieval_System, and crashes when *nc is used with large corpora.

…cement of a parenthesis. Updated predicted values for unit tests

PeteBleackley · 2018-04-07T22:04:07Z

Still having problems with test_persistence and test_persistence_compressed, but I've checked in the latest version of the code so that other people can see the latest version and hopefully suggest a solution to the problem with those tests

markroxor · 2018-04-08T14:50:12Z

gensim/test/test_tfidfmodel.py

@@ -169,103 +171,100 @@ def test_consistency(self):
        docs = [corpus[1], corpus[2]]


Please add a self. to corpus here and everywhere.

I changed self.corpus to corpus because my calculations for expected_docs are based on the corpus given at the top of test_tfidfmodel.py, and self.corpus (based on test_data/test_corpus.mm) doesn't give matching results

If self.corpus is not used anywhere I suggest removing it from the setUp method.
@menshikh-iv please suggest the corpus which should be chosen.

up to you guys, it is not so important

Looks good. Just one last step, since we are not using testcorpus.mm (defined in the setUp method) can you replace it with text, dictionary and corpus defined at the top of the code.
After that it is ready to merge @menshikh-iv

ping @PeteBleackley

I'm still using self.corpus in the persistence tests.

Please use either self.corpus or corpus, the redundancy is polluting the code.

markroxor · 2018-04-08T14:53:44Z

For testing your code locally I will recommend using tox -- gensim/test/test_tfidfmodel.py command in linux. Install it using pip pip install tox.

markroxor

Please run flake8 gensim/ locally before pushing to check for pep8 errors.

markroxor · 2018-04-08T14:54:37Z

gensim/models/tfidfmodel.py

@@ -118,6 +118,7 @@ def precompute_idfs(wglobal, dfs, total_docs):
    # not strictly necessary and could be computed on the fly in TfidfModel__getitem__.
    # this method is here just to speed things up a little.
    return {termid: wglobal(df, total_docs) for termid, df in iteritems(dfs)}
+


remove this extra line.

Various blank lines are the ghosts of prints put in during debugging.

These are all pep8 errors which will error out your builds. Please use flake8 gensim/ command as suggested before.

markroxor · 2018-04-08T14:56:26Z

gensim/models/tfidfmodel.py

    elif n_df == "p":
-        return np.log((1.0 * totaldocs - docfreq) / docfreq) / np.log(2)
+        return max(0,np.log2((1.0 * totaldocs - docfreq) / docfreq))


return max(0, np.log2((1.0 * totaldocs - docfreq) / docfreq))

markroxor · 2018-04-08T14:57:01Z

gensim/models/tfidfmodel.py

@@ -295,6 +297,7 @@ def __init__(self, corpus=None, id2word=None, dictionary=None, wlocal=utils.iden
        self.id2word = id2word
        self.wlocal, self.wglobal, self.normalize = wlocal, wglobal, normalize
        self.num_docs, self.num_nnz, self.idfs = None, None, None
+


Remove whitespaces in this line.

markroxor · 2018-04-08T14:57:17Z

gensim/models/tfidfmodel.py

@@ -371,6 +374,7 @@ def initialize(self, corpus):
            numnnz += len(bow)
            for termid, _ in bow:
                dfs[termid] = dfs.get(termid, 0) + 1
+


Remove this line.

markroxor · 2018-04-08T14:58:16Z

gensim/test/test_tfidfmodel.py

@@ -39,6 +40,7 @@ class TestTfidfModel(unittest.TestCase):
    def setUp(self):
        self.corpus = MmCorpus(datapath('testcorpus.mm'))

+


Remove this line.

markroxor · 2018-04-08T14:58:32Z

gensim/test/test_tfidfmodel.py

Remove whitespaces in this line.

markroxor

Please take a look here

gensim/test/test_tfidfmodel.py:115: AssertionError
__________________ TestTfidfModel.test_persistence_compressed __________________
self = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_persistence_compressed>
    def test_persistence_compressed(self):
        # Test persistence without using `smartirs`
        fname = get_tmpfile('gensim_models.tst.gz')
        model = tfidfmodel.TfidfModel(self.corpus, normalize=True)
        model.save(fname)
        model2 = tfidfmodel.TfidfModel.load(fname, mmap=None)
        self.assertTrue(model.idfs == model2.idfs)
        tstvec = [corpus[1], corpus[2]]
        self.assertTrue(np.allclose(model[tstvec[0]], model2[tstvec[0]]))
        self.assertTrue(np.allclose(model[tstvec[1]], model2[tstvec[1]]))
        self.assertTrue(np.allclose(model[[]], model2[[]]))  # try projecting an empty vector
    
        # Test persistence with using `smartirs`
        fname = get_tmpfile('gensim_models_smartirs.tst.gz')
        model = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc")
        model.save(fname)
        model2 = tfidfmodel.TfidfModel.load(fname, mmap=None)
        self.assertTrue(model.idfs == model2.idfs)
        tstvec = [corpus[1], corpus[2]]
        self.assertTrue(np.allclose(model[tstvec[0]], model2[tstvec[0]]))
        self.assertTrue(np.allclose(model[tstvec[1]], model2[tstvec[1]]))
        self.assertTrue(np.allclose(model[[]], model2[[]]))  # try projecting an empty vector
    
        # Test persistence between Gensim v3.2.0 and current compressed model.
        model3 = tfidfmodel.TfidfModel(self.corpus, smartirs="ntc")
        model4 = tfidfmodel.TfidfModel.load(datapath('tfidf_model.tst.bz2'))
>       self.assertTrue(model3.idfs == model4.idfs)
E       AssertionError: False is not true
fname      = '/tmp/gensim_models_smartirs.tst.gz'
model      = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1f10>
model2     = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1e50>
model3     = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1910>
model4     = <gensim.models.tfidfmodel.TfidfModel object at 0x7fd8c44e1f90>
self       = <gensim.test.test_tfidfmodel.TestTfidfModel testMethod=test_persistence_compressed>
tstvec     = [[(3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)], [(5, 2), (9, 1), (10, 1)]]

Tests are failing because files tfidf_model.tst.bz2 and tfidf_model.tst were pushed by me. Too maintain the persistence you need to create these files manually for the modified version.
Hint - I will suggest using model.save(fname).

Also while you are on it please use more intuitive file name while creating the model like - tfidf_smartirs_model.tst

menshikh-iv · 2018-04-09T05:47:24Z

Thank you @PeteBleackley for detailed investigation, LGTM for me, do you agree @markroxor?

PeteBleackley and others added 24 commits November 23, 2017 15:28

Added Montemurro and Zanette's entropy-based keyword extraction algor…

3160dea

…ithm

Improved Docstrings

0550651

Fixed numerical bugs due to zero frequencies

a072f93

Merge branch 'develop' into develop

02428fa

Coding style changes, test and tutorial

c8a3792

Trying to fix a merge conflict

8a8264e

I hate git

e763f3c

Summarization tutorial

195c3c0

Fixed some failing tests

5b9a3ad

Tests, demo, nan_to_num and a few last flake8 issues

4c2d8de

Further flake8 issues

d9c290a

Further flake8 issues

8809e5a

Removed Jupyter checkpoint

a97fd82

Removed trailing whitespace

0d4e31c

Trailing whitespace

4d18223

Speed up test and add comment to explain threshold value

dc42cee

Flake8 again

fdddf02

rename vars + style fixes

86db65c

fix operation order

28ae7cb

Update docs with Montemurro and Zanette's algorithm

6add3ba

Revert "Update docs with Montemurro and Zanette's algorithm"

86590bb

This reverts commit 6add3ba.

Merge remote-tracking branch 'upstream/master' into develop

eb65041

Merge remote-tracking branch 'upstream/develop' into develop

7711451

Fixed bug in TfidfModel, as described in Issue #2020

bdc1a6d

menshikh-iv changed the title ~~Fix bug in TfidfModel~~ Fix SMART from TfidfModel for case when df == "n". Fix #2020 Apr 6, 2018

Fix return type

590b52a

PeteBleackley added 2 commits April 6, 2018 13:31

Updated unit tests for TfidfModel

3b64a78

Merge branch 'develop' of https://github.com/PeteBleackley/gensim int…

3588b88

…o develop

Updated unit tests for TfidfModel

5f9aa93

Changed log(x)/log(2) to log2(x) since this is clearer. Fixed the pla…

fdab1e8

…cement of a parenthesis. Updated predicted values for unit tests

markroxor suggested changes Apr 8, 2018

View reviewed changes

PeteBleackley added 2 commits April 8, 2018 20:40

Fixed persistence tests

d2f3ea8

Flake 8

5aff83b

menshikh-iv merged commit 06f5f5c into piskvorky:develop Apr 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SMART from TfidfModel for case when `df == "n"`. Fix #2020 #2021

Fix SMART from TfidfModel for case when `df == "n"`. Fix #2020 #2021

PeteBleackley commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

PeteBleackley commented Apr 6, 2018

PeteBleackley commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

PeteBleackley commented Apr 6, 2018 •

edited

Loading

PeteBleackley commented Apr 6, 2018

markroxor commented Apr 7, 2018

PeteBleackley commented Apr 7, 2018

PeteBleackley commented Apr 7, 2018

markroxor Apr 8, 2018

PeteBleackley Apr 8, 2018

markroxor Apr 8, 2018

menshikh-iv Apr 9, 2018

markroxor Apr 9, 2018 •

edited

Loading

markroxor Apr 10, 2018

PeteBleackley Apr 10, 2018

markroxor Apr 10, 2018

markroxor commented Apr 8, 2018 •

edited

Loading

markroxor left a comment

markroxor Apr 8, 2018

PeteBleackley Apr 8, 2018

markroxor Apr 8, 2018

markroxor Apr 8, 2018 •

edited by piskvorky

Loading

markroxor Apr 8, 2018

markroxor Apr 8, 2018

markroxor Apr 8, 2018

markroxor Apr 8, 2018

markroxor left a comment •

edited

Loading

menshikh-iv commented Apr 9, 2018

		@@ -169,103 +171,100 @@ def test_consistency(self):
		docs = [corpus[1], corpus[2]]

		@@ -39,6 +40,7 @@ class TestTfidfModel(unittest.TestCase):
		def setUp(self):
		self.corpus = MmCorpus(datapath('testcorpus.mm'))

Fix SMART from TfidfModel for case when df == "n". Fix #2020 #2021

Fix SMART from TfidfModel for case when df == "n". Fix #2020 #2021

Conversation

PeteBleackley commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

PeteBleackley commented Apr 6, 2018

PeteBleackley commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

menshikh-iv commented Apr 6, 2018

PeteBleackley commented Apr 6, 2018 • edited Loading

PeteBleackley commented Apr 6, 2018

markroxor commented Apr 7, 2018

PeteBleackley commented Apr 7, 2018

PeteBleackley commented Apr 7, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markroxor Apr 9, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markroxor commented Apr 8, 2018 • edited Loading

markroxor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markroxor Apr 8, 2018 • edited by piskvorky Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markroxor left a comment • edited Loading

Choose a reason for hiding this comment

menshikh-iv commented Apr 9, 2018

Fix SMART from TfidfModel for case when `df == "n"`. Fix #2020 #2021

Fix SMART from TfidfModel for case when `df == "n"`. Fix #2020 #2021

PeteBleackley commented Apr 6, 2018 •

edited

Loading

markroxor Apr 9, 2018 •

edited

Loading

markroxor commented Apr 8, 2018 •

edited

Loading

markroxor Apr 8, 2018 •

edited by piskvorky

Loading

markroxor left a comment •

edited

Loading