Add inference for new unseen author for `gensim.models.AuthorTopicModel` #1766

Stamenov · 2017-12-06T20:38:11Z

Add function get_new_author_topics() to infer topics distribution of a new unseen author, by passing a corpus - list of documents in "bag of words" format.

… new unseen author.

Stamenov · 2017-12-06T20:50:16Z

Sill some open TODOs:

how should the rho() function be defined inside the get_new_author_topics()
unit test

menshikh-iv

Thanks for contribution @Stamenov, please continue your work according to your plan and my comments.
Also, please have a look at PEP8 problems https://travis-ci.org/RaRe-Technologies/gensim/jobs/312612778#L516

@olavurmortensen can you review PR too?

menshikh-iv · 2017-12-07T08:02:09Z

gensim/models/atmodel.py

@@ -882,6 +883,98 @@ def get_document_topics(self, word_id, minimum_probability=None):

        raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.')

+    def get_new_author_topics(self, corpus, minimum_probability=None):
+        """


Please use numpy-style docstrings: http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html, https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

menshikh-iv · 2017-12-07T08:02:52Z

gensim/models/atmodel.py

+        except Exception as e:
+            #something went wrong! Rollback temporary changes in object and log
+            rollback_new_author_chages()
+            logging.error(traceback.format_exc())


no need to use traceback here, use logging.exception(e) instead

menshikh-iv · 2017-12-07T08:03:25Z

gensim/models/atmodel.py

+                corpus, self.author2doc, self.doc2author, rho(),
+                collect_sstats=False, chunk_doc_idx=corpus_doc_idx
+            )
+        except Exception as e:


Too broad clause, please specify the concrete type(s) of exception.

Stamenov · 2017-12-07T10:36:04Z

Of course I will continue my work here. Thanks for quick review @menshikh-iv . I have already addressed your remarks in my code.

olavurmortensen · 2017-12-07T14:38:34Z

Great work @Stamenov. At a glance, it looks good to me, although do I have some comments.

Let me see if I understand the logic of the code, correct me if I'm wrong. First of all, you are inferring gamma for a collection of documents, assuming that they are attributed to a single author. So what you do is:

Add documents to self.corpus.
Add a single temporary author to author dictionaries.
Randomly initialize gamma, as per usual.
Run self.inference to obtain gammas for the documents.
Remove authors and documents from model.

This is what you do, right?

I don't think adding the documents to self.corpus is necessary, because self.inference only uses the documents you pass to it. Calling self.extend_corpus is also very slow, unfortunately, so always avoid this if possible.

As a side note, the chunk_doc_idx argument to self.inference has to do with getting an author list from self.doc2author[doc_no], it doesn't have to do with what documents are accessed.

rho is a bit tricky, as you mention. The point of rho is to interpolate between a previously computed gamma and the new one, as in line 443 in the code. I'm not quite sure about how to use it here this just yet.

It makes sense to make it in a single pass, since you're not updating the lambdas, and only one author is being updated. So it may be a good idea to make sure the number of iterations (over each document) is high enough. Maybe let self.inference accept iterations as an (optional) argument.

Finally, I must note that it isn't obvious what inference on held-out data should be in the author-topic model. I think that is because observations aren't independent. This method doesn't take into account that many author may contribute to single document, which is really the strength of the AT model. That being said, I think this way is the best way to do it.

Sorry for the wall of text 😝 Keep up the good work!

olavurmortensen · 2017-12-07T14:40:07Z

gensim/models/atmodel.py

+        new_author_name = "placeholder_name"
+
+        # Add new documents in corpus to self.corpus.
+        self.extend_corpus(corpus)


Remove line 939. self.inference uses the corpus supplied as input, and calling self.extend_corpus is very slow.

menshikh-iv · 2017-12-25T13:52:43Z

ping @Stamenov, what's status here?

Stamenov · 2017-12-25T13:56:08Z

Sorry, christmas activities. Will try to push the recommended changes these days.

Stamenov · 2018-01-02T10:37:42Z

https://travis-ci.org/RaRe-Technologies/gensim/jobs/324067238
Can someone help me out with this build error? Tox package?
Thanks.

piskvorky · 2018-01-02T14:36:56Z

@Stamenov thanks! @menshikh-iv will help out with the build error once he's back from holiday (next week) :)

Stamenov · 2018-01-02T17:40:14Z

Great, 10x for the quick response.

menshikh-iv · 2018-01-08T09:41:45Z

@Stamenov this isn't a tox problem (pip can't install tox package because the network disappeared in Travis), I re-run job, let's wait for the result.

menshikh-iv

Please add tests for this functionality

CC @olavurmortensen, can you review this again?

menshikh-iv · 2018-01-08T09:42:25Z

gensim/models/atmodel.py

@@ -882,6 +882,84 @@ def get_document_topics(self, word_id, minimum_probability=None):

        raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.')

+    def get_new_author_topics(self, corpus, minimum_probability=None):
+        """Infers topics for new author.


please use numpy-style docstrings: http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html and https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

menshikh-iv · 2018-01-08T09:43:21Z

gensim/models/atmodel.py

@@ -882,6 +882,84 @@ def get_document_topics(self, word_id, minimum_probability=None):

        raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.')

+    def get_new_author_topics(self, corpus, minimum_probability=None):


What's a reason to minimum_probability=None here (instead of 1e-8)?

@menshikh-iv It is the same in get_document_topics in LDA (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L987), and get_term_topics as well.

so, I should leave it like this, right?

@Stamenov yes, leave as is

menshikh-iv · 2018-01-08T09:57:12Z

gensim/models/atmodel.py

+            len_input_corpus = sum(1 for _ in corpus)
+        if len_input_corpus == 0:
+            logger.warning("AuthorTopicModel.get_new_author_topics() called with an empty corpus")
+            return


Why "empty return" needed? Maybe better raise exception explicitly (for all empty return)

olavurmortensen · 2018-01-08T15:47:22Z

gensim/models/atmodel.py

+        if len_input_corpus == 0:
+            logger.warning("AuthorTopicModel.get_new_author_topics() called with an empty corpus")
+            return
+        if not len_input_corpus < self.chunksize:


What is the reason for this limitation? I don't see it implemented anywhere else.

olavurmortensen · 2018-01-08T16:20:45Z

gensim/models/atmodel.py

+            return
+
+        new_author_name = "placeholder_name"
+        corpus_doc_idx = list(range(0, len_input_corpus))


Instead set corpus_doc_idx to list(range(self.total_docs, self.total_docs + len_input_corpus)) (see comment below).

olavurmortensen · 2018-01-08T16:21:48Z

gensim/models/atmodel.py

+        # Add new author in author2doc and doc into doc2author.
+        self.author2doc[new_author_name] = corpus_doc_idx
+        for new_doc_id in corpus_doc_idx:
+            self.doc2author[new_doc_id] = [new_author_name]


Because of how you set corpus_doc_idx, you're overwriting self.doc2author[0] (0 through len_corpus_idx) here.

olavurmortensen · 2018-01-08T16:29:16Z

@Stamenov I made some comments in the code.

I'm afraid I've written the inference method in a sort of clumsy way, which makes this more cumbersome than it should be, sorry about that. The code really could use a refactor, but I don't have the time to do that.

menshikh-iv · 2018-01-15T13:28:32Z

ping @Stamenov, when you plan to finish this PR?

Stamenov · 2018-01-15T13:32:24Z

I am doing this parallel to my thesis, so mids February.

menshikh-iv · 2018-02-12T05:27:29Z

@Stamenov what else do you need to complete this PR?

Stamenov · 2018-02-12T10:26:07Z

Hi, I am have done some fixing around the builds.
Also in order to satisfy the Travis CI build I had to change the docstring for get_aurhor_topics(), which was not in the scope of my PR. Is this ok, can you please look into it?
All but one is now green, ci/circleci, and it seems to be ldamodel-related, so I could use some help there.
Furthermore, I need some test cases to implement, so if you have some off the top of your head, I will appreciate them.
I am also working on a accompanying evaluation report (paper) for authorship prediction on a couple of datasets, using this inference function, if this will be of any interest for Gensim.

…e_new_author

Stamenov · 2018-02-23T12:28:13Z

gensim/models/atmodel.py

            return pow(self.offset + 1 + 1, -self.decay)
+=======
+            return pow(self.offset + 1, -self.decay)
+>>>>>>> Stashed changes


Stamenov · 2018-02-23T13:39:39Z

gensim/models/atmodel.py

+            Topic distribution for the given `corpus`.
+
+        """
+        # TODO: how should this function look like for get_new_author_topics?


As I mentioned I am iteration over a few versions for the function and will commit the best performing.
It was already discussed here that it is not clear how we should implement this one.

Stamenov · 2018-02-23T13:41:33Z

gensim/models/atmodel.py

+        except ValueError as e:
+            # Something went wrong! Rollback temporary changes in object and log
+            rollback_new_author_chages()
+            logging.exception(e)


It is critical to be caught, since if it is not, we won't call rollback_new_author_changes*() and the state of the model will be altered with the temp vars.

Stamenov · 2018-02-23T13:44:32Z

gensim/models/atmodel.py

+        except ValueError as e:
+            # Something went wrong! Rollback temporary changes in object and log
+            rollback_new_author_chages()
+            logging.exception(e)


I would even much rather catch any exception, rollback the changes and log the exception.
The warning can be an addition to that.

Stamenov · 2018-02-23T14:09:31Z

gensim/test/test_atmodel.py

@@ -450,6 +450,31 @@ def testTermTopics(self):
            self.assertTrue(isinstance(topic_no, int))
            self.assertTrue(isinstance(probability, float))

+    def testNewAuthorTopics(self):


Besides the sanity check, I tried another test case, where I would infer the topics for a new author, using the exact same documents from the corpus that an author already is assigned to. We can expect that the topic distributions should be very similar, as it is, but not close enough so that we can use np.allclose(..,..). Example:

In [2]: model.get_new_author_topics(corpus=corpus[1:3])
Out[2]: [(0, 0.914887958236441), (1, 0.08511204176355897)]

In [3]: model.get_author_topics(author_name="sally")
Out[3]: [(0, 0.9290420023752036), (1, 0.07095799762479651)]

Of course something li this would work.
sally_topics = model.get_author_topics(author_name="sally")
new_authortopics = model.get_new_author_topics(corpus=corpus[1:3])
self.assertTrue(np.allclose(jillnewauthortopics,jilltopics, atol=1e-1))

@menshikh-iv do you approve of this test?

Stamenov · 2018-02-26T22:56:50Z

@menshikh-iv I have pushed the test and exception improvement, also the rho() function which works best for me. Hope it is satisfactory.

menshikh-iv · 2018-02-27T04:44:23Z

Great work @Stamenov 👍 LGTM for me (only fix PEP8 mistakes please)

@olavurmortensen ping, PR looks ready, are you approve?

olavurmortensen · 2018-03-02T10:11:09Z

@Stamenov @menshikh-iv Sorry for not responding, I was sick for a while. I'll take a look soon.

olavurmortensen · 2018-03-05T10:08:52Z

gensim/models/atmodel.py

+            raise ValueError("AuthorTopicModel.get_new_author_topics() called with an empty corpus")
+
+        new_author_name = "placeholder_name"
+        corpus_doc_idx = list(range(self.total_docs, self.total_docs + len_input_corpus))


Add a comment explaining this line. For example: "indexes representing the documents in the input corpus".

olavurmortensen · 2018-03-05T10:09:43Z

gensim/models/atmodel.py

+
+        # Add the new placeholder author to author2id/id2author dictionaries.
+        num_new_authors = 1
+        author_id = 0


I think this looks confusing. Just say author_id = self.num_authors, or something like that.

olavurmortensen · 2018-03-05T10:10:38Z

gensim/models/atmodel.py

+        if len_input_corpus == 0:
+            raise ValueError("AuthorTopicModel.get_new_author_topics() called with an empty corpus")
+
+        new_author_name = "placeholder_name"


Just for completeness, check that this author isn't already in the dictionary.

olavurmortensen · 2018-03-05T10:13:12Z

gensim/models/atmodel.py

+        self.state.gamma = np.vstack([self.state.gamma, gamma_new])
+
+        # Should not record the sstats, as we are goint to delete the new author after calculated.
+        try:


Does this try block throw an exception if there is a problem?

Yes, it does, for example, when corpus is invalid (list of something else), or if the inference is somehow interrupted, or maybe something else. Must make sure to rollback the temporary changes in any case. If makes sense to show the exception to the user, but leave the model state untouched.

olavurmortensen · 2018-03-05T10:19:23Z

I made some comments in the code. I think it would also nice to see some example use-cases on real data, hopefully with good results, to see if this actually works.

Stamenov · 2018-03-05T11:03:58Z

@olavurmortensen I will post a notebook, where I train and predict authors on the Reuters 50x50 dataset soon!

menshikh-iv · 2018-03-12T07:50:13Z

@Stamenov how is going? when you plan to finish PR (looks like this almost done).

Stamenov · 2018-03-12T11:04:57Z

Hi, the PR is ready, as far as code and functionality is concerned.
The only thing is the playbook, which I have prepared yesterday, but I was getting very strange and inconsistent results. I have to take my time to debug it properly.
I can commit a preliminary version of it though.
I would say this PR is closable and I can open up a new one for the playbook?

menshikh-iv · 2018-03-12T11:29:48Z

@Stamenov better to add notebook in current PR (this is hardly-related changes), we'll merge together
new functionality & notebook with an example as one PR.

Also, don't forget to resolve cosmetic issues based on @olavurmortensen comments.

menshikh-iv · 2018-03-22T06:14:51Z

@Stamenov great, code looks good to merge 👍, the last thing - please add an example to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb

Stamenov · 2018-03-22T11:39:11Z

Thanks. I am working on it. I believe this weel it should be done.

menshikh-iv · 2018-03-26T07:56:02Z

Hello @Stamenov

You upload some html (instead of ipynb), look inside your file
Please add a new section to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb (instead of adding new notebook)

Stamenov · 2018-03-26T10:36:47Z

Hi @menshikh-iv ,
I see your point with adding a new section instead of a new notebook, but I find it would be very difficult in keeping it clean. There would be a lot of double code from the atmodel_tutorial.ipynb and the notebook itself would become very long. Also I use a different dataset, not the NIPS one.

menshikh-iv · 2018-03-26T11:21:54Z

Thanks for your work @Stamenov 👍 nice feature!

Thanks @olavurmortensen for review 👍

jonaschn

I am interested in an appropriate value of rhot for visualizing the AT model in bmabey/pyLDAvis#161

jonaschn · 2021-03-11T09:53:06Z

gensim/models/atmodel.py

+            Topic distribution for the given `corpus`.
+
+        """
+        def rho():


@Stamenov @olavurmortensen Please document the rationale behind this definition of rho.
Why didn't you write return pow(self.offset + 2 , -self.decay)?
In an earlier version this was return pow(self.offset + 1, -self.decay).
Could you provide any hint?

In a different place you explain rho as following:
https://github.com/RaRe-Technologies/gensim/blob/75dce4b50174b5a1afb37b163b903f7e2af903b0/gensim/models/atmodel.py#L812-L816

Add function get_new_author_topics() to infer topics distribution for…

f68dfe9

… new unseen author.

menshikh-iv suggested changes Dec 7, 2017

View reviewed changes

Stamenov added 2 commits December 7, 2017 11:23

Fixes for pep8 compliance. Concrete exception handling.

2843d1b

small docstring fix

92ef759

olavurmortensen suggested changes Dec 7, 2017

View reviewed changes

Stamenov added 2 commits December 28, 2017 12:49

dont extend self.corpus

ee51f71

pep8 fixes

2de1b34

menshikh-iv suggested changes Jan 8, 2018

View reviewed changes

olavurmortensen reviewed Jan 8, 2018

View reviewed changes

Stamenov added 3 commits February 10, 2018 15:57

remove chunksize limitation and small fix

cbe9049

pep8 fix

a94c0b1

try pep8 fix

783a585

menshikh-iv added 2 commits February 12, 2018 16:45

Merge remote-tracking branch 'upstream/develop' into feature/inferenc…

3849711

…e_new_author

convert docstring to numpy-style

a85563b

Stamenov commented Feb 23, 2018

View reviewed changes

Stamenov added 2 commits February 26, 2018 23:53

fix exception catching and rollback

0a3eedb

add test for topic similarity

f587596

fix pep8

1bc929c

olavurmortensen reviewed Mar 5, 2018

View reviewed changes

menshikh-iv changed the title ~~Feature: New Author Inference~~ Add inference for new unseen author for gensim.models.AuthorTopicModel Mar 9, 2018

Stamenov added 2 commits March 21, 2018 17:43

some last cosmetic changes

6085b90

author id fix

34f5222

add tutorial for authorship prediction

6d7141d

add ipynb

c525845

menshikh-iv merged commit 2e08f4d into piskvorky:develop Mar 26, 2018

menshikh-iv mentioned this pull request Apr 23, 2018

Enabling inference on held-out data in the author-topic model #1166

Closed

jonaschn reviewed Mar 11, 2021

View reviewed changes

jonaschn mentioned this pull request Mar 11, 2021

Add gensim Author-Topic-Model support bmabey/pyLDAvis#161

Closed

		@@ -882,6 +882,84 @@ def get_document_topics(self, word_id, minimum_probability=None):

		raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.')

		def get_new_author_topics(self, corpus, minimum_probability=None):

Add inference for new unseen author for gensim.models.AuthorTopicModel #1766

Add inference for new unseen author for gensim.models.AuthorTopicModel #1766

Conversation

Stamenov commented Dec 6, 2017

Stamenov commented Dec 6, 2017

menshikh-iv left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stamenov commented Dec 7, 2017

olavurmortensen commented Dec 7, 2017

olavurmortensen Dec 7, 2017 • edited Loading

Choose a reason for hiding this comment

menshikh-iv commented Dec 25, 2017

Stamenov commented Dec 25, 2017

Stamenov commented Jan 2, 2018

piskvorky commented Jan 2, 2018

Stamenov commented Jan 2, 2018

menshikh-iv commented Jan 8, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olavurmortensen Jan 8, 2018 • edited Loading

Choose a reason for hiding this comment

olavurmortensen Jan 8, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olavurmortensen commented Jan 8, 2018

menshikh-iv commented Jan 15, 2018

Stamenov commented Jan 15, 2018

menshikh-iv commented Feb 12, 2018

Stamenov commented Feb 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Stamenov Feb 23, 2018 • edited Loading

Choose a reason for hiding this comment

Stamenov commented Feb 26, 2018

menshikh-iv commented Feb 27, 2018 • edited Loading

olavurmortensen commented Mar 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

olavurmortensen commented Mar 5, 2018

Stamenov commented Mar 5, 2018

menshikh-iv commented Mar 12, 2018

Stamenov commented Mar 12, 2018

menshikh-iv commented Mar 12, 2018 • edited Loading

menshikh-iv commented Mar 22, 2018

Stamenov commented Mar 22, 2018

menshikh-iv commented Mar 26, 2018

Stamenov commented Mar 26, 2018

menshikh-iv commented Mar 26, 2018

jonaschn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add inference for new unseen author for `gensim.models.AuthorTopicModel` #1766

Add inference for new unseen author for `gensim.models.AuthorTopicModel` #1766

menshikh-iv left a comment •

edited

Loading

olavurmortensen Dec 7, 2017 •

edited

Loading

olavurmortensen Jan 8, 2018 •

edited

Loading

olavurmortensen Jan 8, 2018 •

edited

Loading

Stamenov Feb 23, 2018 •

edited

Loading

menshikh-iv commented Feb 27, 2018 •

edited

Loading

menshikh-iv commented Mar 12, 2018 •

edited

Loading