Enabling inference on held-out data in the author-topic model #1166
Labels
difficulty medium
Medium issue: required good gensim understanding & python skills
feature
Issue described a new feature
At the moment, it is not possible to make inference on held-out data in the
AuthorTopicModel
, and as a result it is not possible to evaluate model fit (bound) on new data either.In LDA, we infer on held-out documents by calling
gammad, _ = self.inference([doc])
, learning the document's topic distributiongamma
(local parameter), without updating the model (sstats
, global parameter), by (implicitly) settingcollect_sstats=False
. This allows us to compute the bound on those documents.It is not 100% clear what inference on held-out data means in the author-topic model. I suggest this definition, analogous to LDA: computing the topic distribution
gamma
for a new author with documentsdocs
without updating the model (i.e. no change tosstats
). Then computing the bound on these held-out documents and authors.The inference algorithm used in the
AuthorTopicModel
, as well as the model class, is very similar toLdaModel
. Therefore, anyone with experience withLdaModel
should find it relatively easy to jump into theAuthorTopicModel
.A report detailing the theory as well as the implementation is available here. This is my masters thesis.
I will offer my assistance with anything related to the
AuthorTopicModel
to anyone willing to take on this issue. For example on a GitHub PR and on Gitter.The text was updated successfully, but these errors were encountered: