-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add inference for new unseen author for gensim.models.AuthorTopicModel
#1766
Add inference for new unseen author for gensim.models.AuthorTopicModel
#1766
Conversation
… new unseen author.
Sill some open TODOs:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contribution @Stamenov, please continue your work according to your plan and my comments.
Also, please have a look at PEP8 problems https://travis-ci.org/RaRe-Technologies/gensim/jobs/312612778#L516
@olavurmortensen can you review PR too?
gensim/models/atmodel.py
Outdated
@@ -882,6 +883,98 @@ def get_document_topics(self, word_id, minimum_probability=None): | |||
|
|||
raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.') | |||
|
|||
def get_new_author_topics(self, corpus, minimum_probability=None): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gensim/models/atmodel.py
Outdated
except Exception as e: | ||
#something went wrong! Rollback temporary changes in object and log | ||
rollback_new_author_chages() | ||
logging.error(traceback.format_exc()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to use traceback
here, use logging.exception(e)
instead
gensim/models/atmodel.py
Outdated
corpus, self.author2doc, self.doc2author, rho(), | ||
collect_sstats=False, chunk_doc_idx=corpus_doc_idx | ||
) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too broad clause, please specify the concrete type(s) of exception.
Of course I will continue my work here. Thanks for quick review @menshikh-iv . I have already addressed your remarks in my code. |
Great work @Stamenov. At a glance, it looks good to me, although do I have some comments. Let me see if I understand the logic of the code, correct me if I'm wrong. First of all, you are inferring
This is what you do, right? I don't think adding the documents to As a side note, the
It makes sense to make it in a single pass, since you're not updating the lambdas, and only one author is being updated. So it may be a good idea to make sure the number of iterations (over each document) is high enough. Maybe let Finally, I must note that it isn't obvious what inference on held-out data should be in the author-topic model. I think that is because observations aren't independent. This method doesn't take into account that many author may contribute to single document, which is really the strength of the AT model. That being said, I think this way is the best way to do it. Sorry for the wall of text 😝 Keep up the good work! |
gensim/models/atmodel.py
Outdated
new_author_name = "placeholder_name" | ||
|
||
# Add new documents in corpus to self.corpus. | ||
self.extend_corpus(corpus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove line 939. self.inference uses the corpus supplied as input, and calling self.extend_corpus is very slow.
ping @Stamenov, what's status here? |
Sorry, christmas activities. Will try to push the recommended changes these days. |
https://travis-ci.org/RaRe-Technologies/gensim/jobs/324067238 |
@Stamenov thanks! @menshikh-iv will help out with the build error once he's back from holiday (next week) :) |
Great, 10x for the quick response. |
@Stamenov this isn't a tox problem (pip can't install tox package because the network disappeared in Travis), I re-run job, let's wait for the result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for this functionality
CC @olavurmortensen, can you review this again?
gensim/models/atmodel.py
Outdated
@@ -882,6 +882,84 @@ def get_document_topics(self, word_id, minimum_probability=None): | |||
|
|||
raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.') | |||
|
|||
def get_new_author_topics(self, corpus, minimum_probability=None): | |||
"""Infers topics for new author. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use numpy-style docstrings: http://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html and https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt
gensim/models/atmodel.py
Outdated
@@ -882,6 +882,84 @@ def get_document_topics(self, word_id, minimum_probability=None): | |||
|
|||
raise NotImplementedError('Method "get_document_topics" is not valid for the author-topic model. Use the "get_author_topics" method.') | |||
|
|||
def get_new_author_topics(self, corpus, minimum_probability=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's a reason to minimum_probability=None
here (instead of 1e-8)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@menshikh-iv It is the same in get_document_topics
in LDA (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L987), and get_term_topics
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, I should leave it like this, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stamenov yes, leave as is
gensim/models/atmodel.py
Outdated
len_input_corpus = sum(1 for _ in corpus) | ||
if len_input_corpus == 0: | ||
logger.warning("AuthorTopicModel.get_new_author_topics() called with an empty corpus") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "empty return" needed? Maybe better raise exception explicitly (for all empty return
)
gensim/models/atmodel.py
Outdated
if len_input_corpus == 0: | ||
logger.warning("AuthorTopicModel.get_new_author_topics() called with an empty corpus") | ||
return | ||
if not len_input_corpus < self.chunksize: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for this limitation? I don't see it implemented anywhere else.
gensim/models/atmodel.py
Outdated
return | ||
|
||
new_author_name = "placeholder_name" | ||
corpus_doc_idx = list(range(0, len_input_corpus)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead set corpus_doc_idx
to list(range(self.total_docs, self.total_docs + len_input_corpus))
(see comment below).
# Add new author in author2doc and doc into doc2author. | ||
self.author2doc[new_author_name] = corpus_doc_idx | ||
for new_doc_id in corpus_doc_idx: | ||
self.doc2author[new_doc_id] = [new_author_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because of how you set corpus_doc_idx
, you're overwriting self.doc2author[0]
(0 through len_corpus_idx
) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@Stamenov I made some comments in the code. I'm afraid I've written the |
ping @Stamenov, when you plan to finish this PR? |
I am doing this parallel to my thesis, so mids February. |
@Stamenov what else do you need to complete this PR? |
Hi, I am have done some fixing around the builds. |
gensim/models/atmodel.py
Outdated
return pow(self.offset + 1 + 1, -self.decay) | ||
======= | ||
return pow(self.offset + 1, -self.decay) | ||
>>>>>>> Stashed changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
D'oh!
gensim/models/atmodel.py
Outdated
Topic distribution for the given `corpus`. | ||
|
||
""" | ||
# TODO: how should this function look like for get_new_author_topics? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned I am iteration over a few versions for the function and will commit the best performing.
It was already discussed here that it is not clear how we should implement this one.
gensim/models/atmodel.py
Outdated
except ValueError as e: | ||
# Something went wrong! Rollback temporary changes in object and log | ||
rollback_new_author_chages() | ||
logging.exception(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is critical to be caught, since if it is not, we won't call rollback_new_author_changes*() and the state of the model will be altered with the temp vars.
gensim/models/atmodel.py
Outdated
except ValueError as e: | ||
# Something went wrong! Rollback temporary changes in object and log | ||
rollback_new_author_chages() | ||
logging.exception(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even much rather catch any exception, rollback the changes and log the exception.
The warning can be an addition to that.
@@ -450,6 +450,31 @@ def testTermTopics(self): | |||
self.assertTrue(isinstance(topic_no, int)) | |||
self.assertTrue(isinstance(probability, float)) | |||
|
|||
def testNewAuthorTopics(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the sanity check, I tried another test case, where I would infer the topics for a new author, using the exact same documents from the corpus that an author already is assigned to. We can expect that the topic distributions should be very similar, as it is, but not close enough so that we can use np.allclose(..,..). Example:
In [2]: model.get_new_author_topics(corpus=corpus[1:3])
Out[2]: [(0, 0.914887958236441), (1, 0.08511204176355897)]
In [3]: model.get_author_topics(author_name="sally")
Out[3]: [(0, 0.9290420023752036), (1, 0.07095799762479651)]
Of course something li this would work.
sally_topics = model.get_author_topics(author_name="sally")
new_authortopics = model.get_new_author_topics(corpus=corpus[1:3])
self.assertTrue(np.allclose(jillnewauthortopics,jilltopics, atol=1e-1))
@menshikh-iv do you approve of this test?
@menshikh-iv I have pushed the test and exception improvement, also the rho() function which works best for me. Hope it is satisfactory. |
Great work @Stamenov 👍 LGTM for me (only fix PEP8 mistakes please) @olavurmortensen ping, PR looks ready, are you approve? |
@Stamenov @menshikh-iv Sorry for not responding, I was sick for a while. I'll take a look soon. |
gensim/models/atmodel.py
Outdated
raise ValueError("AuthorTopicModel.get_new_author_topics() called with an empty corpus") | ||
|
||
new_author_name = "placeholder_name" | ||
corpus_doc_idx = list(range(self.total_docs, self.total_docs + len_input_corpus)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining this line. For example: "indexes representing the documents in the input corpus".
gensim/models/atmodel.py
Outdated
|
||
# Add the new placeholder author to author2id/id2author dictionaries. | ||
num_new_authors = 1 | ||
author_id = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks confusing. Just say author_id = self.num_authors
, or something like that.
if len_input_corpus == 0: | ||
raise ValueError("AuthorTopicModel.get_new_author_topics() called with an empty corpus") | ||
|
||
new_author_name = "placeholder_name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for completeness, check that this author isn't already in the dictionary.
self.state.gamma = np.vstack([self.state.gamma, gamma_new]) | ||
|
||
# Should not record the sstats, as we are goint to delete the new author after calculated. | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this try block throw an exception if there is a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does, for example, when corpus is invalid (list of something else), or if the inference is somehow interrupted, or maybe something else. Must make sure to rollback the temporary changes in any case. If makes sense to show the exception to the user, but leave the model state untouched.
I made some comments in the code. I think it would also nice to see some example use-cases on real data, hopefully with good results, to see if this actually works. |
@olavurmortensen I will post a notebook, where I train and predict authors on the Reuters 50x50 dataset soon! |
gensim.models.AuthorTopicModel
@Stamenov how is going? when you plan to finish PR (looks like this almost done). |
Hi, the PR is ready, as far as code and functionality is concerned. |
@Stamenov better to add notebook in current PR (this is hardly-related changes), we'll merge together Also, don't forget to resolve cosmetic issues based on @olavurmortensen comments. |
@Stamenov great, code looks good to merge 👍, the last thing - please add an example to https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/atmodel_tutorial.ipynb |
Thanks. I am working on it. I believe this weel it should be done. |
Hello @Stamenov
|
Hi @menshikh-iv , |
Thanks for your work @Stamenov 👍 nice feature! Thanks @olavurmortensen for review 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am interested in an appropriate value of rhot
for visualizing the AT model in bmabey/pyLDAvis#161
Topic distribution for the given `corpus`. | ||
|
||
""" | ||
def rho(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Stamenov @olavurmortensen Please document the rationale behind this definition of rho.
Why didn't you write return pow(self.offset + 2 , -self.decay)
?
In an earlier version this was return pow(self.offset + 1, -self.decay)
.
Could you provide any hint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a different place you explain rho as following:
https://github.com/RaRe-Technologies/gensim/blob/75dce4b50174b5a1afb37b163b903f7e2af903b0/gensim/models/atmodel.py#L812-L816
Add function
get_new_author_topics()
to infer topics distribution of a new unseen author, by passing acorpus
- list of documents in "bag of words" format.