Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inference for new unseen author for gensim.models.AuthorTopicModel #1766

Merged
merged 24 commits into from
Mar 26, 2018
Merged
Changes from 14 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 83 additions & 3 deletions gensim/models/atmodel.py
Original file line number Diff line number Diff line change
Expand Up @@ -919,15 +919,95 @@ def get_document_topics(self, word_id, minimum_probability=None):
'Use the "get_author_topics" method.'
)

def get_new_author_topics(self, corpus, minimum_probability=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's a reason to minimum_probability=None here (instead of 1e-8)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@menshikh-iv It is the same in get_document_topics in LDA (https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/ldamodel.py#L987), and get_term_topics as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, I should leave it like this, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stamenov yes, leave as is

"""Infers topics for new author.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Infers a topic distribution for a new author over the passed corpus of docs,
assuming that all documents are from this single new author.

Parameters
----------
corpus : iterable of iterable of (int, int)
Corpus in BoW format.
minimum_probability : float, optional
Ignore topics with probability below this value, if None - 1e-8 is used.

Returns
-------
list of (int, float)
Topic distribution for the given `corpus`.

"""
# TODO: how should this function look like for get_new_author_topics?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's about this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned I am iteration over a few versions for the function and will commit the best performing.
It was already discussed here that it is not clear how we should implement this one.

def rho():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Stamenov @olavurmortensen Please document the rationale behind this definition of rho.
Why didn't you write return pow(self.offset + 2 , -self.decay)?
In an earlier version this was return pow(self.offset + 1, -self.decay).
Could you provide any hint?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<<<<<<< Updated upstream
return pow(self.offset + 1 + 1, -self.decay)
=======
return pow(self.offset + 1, -self.decay)
>>>>>>> Stashed changes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

D'oh!


# Wrap in fuction to avoid code duplication.
def rollback_new_author_chages():
self.state.gamma = self.state.gamma[0:-1]

del self.author2doc[new_author_name]
a_id = self.author2id[new_author_name]
del self.id2author[a_id]
del self.author2id[new_author_name]

for new_doc_id in corpus_doc_idx:
del self.doc2author[new_doc_id]

try:
len_input_corpus = len(corpus)
except TypeError:
logger.warning("input corpus stream has no len(); counting documents")
len_input_corpus = sum(1 for _ in corpus)
if len_input_corpus == 0:
raise ValueError("AuthorTopicModel.get_new_author_topics() called with an empty corpus")

new_author_name = "placeholder_name"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for completeness, check that this author isn't already in the dictionary.

corpus_doc_idx = list(range(self.total_docs, self.total_docs + len_input_corpus))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining this line. For example: "indexes representing the documents in the input corpus".


# Add the new placeholder author to author2id/id2author dictionaries.
num_new_authors = 1
author_id = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks confusing. Just say author_id = self.num_authors, or something like that.

self.author2id[new_author_name] = author_id + self.num_authors
self.id2author[author_id + self.num_authors] = new_author_name

# Add new author in author2doc and doc into doc2author.
self.author2doc[new_author_name] = corpus_doc_idx
for new_doc_id in corpus_doc_idx:
self.doc2author[new_doc_id] = [new_author_name]
Copy link
Contributor

@olavurmortensen olavurmortensen Jan 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of how you set corpus_doc_idx, you're overwriting self.doc2author[0] (0 through len_corpus_idx) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


gamma_new = self.random_state.gamma(100., 1. / 100., (num_new_authors, self.num_topics))
self.state.gamma = np.vstack([self.state.gamma, gamma_new])

# Should not record the sstats, as we are goint to delete the new author after calculated.
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this try block throw an exception if there is a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does, for example, when corpus is invalid (list of something else), or if the inference is somehow interrupted, or maybe something else. Must make sure to rollback the temporary changes in any case. If makes sense to show the exception to the user, but leave the model state untouched.

gammat, _ = self.inference(
corpus, self.author2doc, self.doc2author, rho(),
collect_sstats=False, chunk_doc_idx=corpus_doc_idx
)
except ValueError as e:
# Something went wrong! Rollback temporary changes in object and log
rollback_new_author_chages()
logging.exception(e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better raise it (or at least warning.warn)? If user has no configured logging, this will be silently skipped. How much is this exception critical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is critical to be caught, since if it is not, we won't call rollback_new_author_changes*() and the state of the model will be altered with the temp vars.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would even much rather catch any exception, rollback the changes and log the exception.
The warning can be an addition to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your point, most critical here - rollback, so, let's move rollback to final section (and this will be done anyway).

Why don't you want to raise an exception (with the condition that rollback anyway happens)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        try:
            gammat, _ = self.inference(
                corpus, self.author2doc, self.doc2author, rho(),
                collect_sstats=False, chunk_doc_idx=corpus_doc_idx
            )
            new_author_topics = self.get_author_topics(new_author_name, minimum_probability)
        finally:
            rollback_new_author_chages()
        return new_author_topics

I would propose this. In this case the exception is raised, but rollback is called.

return

new_author_topics = self.get_author_topics(new_author_name, minimum_probability)
rollback_new_author_chages()
return new_author_topics

def get_author_topics(self, author_name, minimum_probability=None):
"""
Return topic distribution the given author, as a list of
Return topic distribution the given author.

Input as as a list of
(topic_id, topic_probability) 2-tuples.
Ignore topics with very low probability (below `minimum_probability`).

Obtaining topic probabilities of each word, as in LDA (via `per_word_topics`),
is not supported.

"""

author_id = self.author2id[author_name]
Expand Down