Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling inference on held-out data in the author-topic model #1166

Closed
olavurmortensen opened this issue Feb 24, 2017 · 2 comments
Closed

Enabling inference on held-out data in the author-topic model #1166

olavurmortensen opened this issue Feb 24, 2017 · 2 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature

Comments

@olavurmortensen
Copy link
Contributor

At the moment, it is not possible to make inference on held-out data in the AuthorTopicModel, and as a result it is not possible to evaluate model fit (bound) on new data either.

In LDA, we infer on held-out documents by calling gammad, _ = self.inference([doc]), learning the document's topic distribution gamma (local parameter), without updating the model (sstats, global parameter), by (implicitly) setting collect_sstats=False. This allows us to compute the bound on those documents.

It is not 100% clear what inference on held-out data means in the author-topic model. I suggest this definition, analogous to LDA: computing the topic distribution gamma for a new author with documents docs without updating the model (i.e. no change to sstats). Then computing the bound on these held-out documents and authors.


The inference algorithm used in the AuthorTopicModel, as well as the model class, is very similar to LdaModel. Therefore, anyone with experience with LdaModel should find it relatively easy to jump into the AuthorTopicModel.

A report detailing the theory as well as the implementation is available here. This is my masters thesis.

I will offer my assistance with anything related to the AuthorTopicModel to anyone willing to take on this issue. For example on a GitHub PR and on Gitter.

@tmylk tmylk added the wishlist Feature request label Feb 25, 2017
@menshikh-iv menshikh-iv added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills and removed wishlist Feature request labels Oct 2, 2017
@nickkimer
Copy link

nickkimer commented Apr 23, 2018

has this been implemented anywhere?

@menshikh-iv
Copy link
Contributor

@nickkimer now possible to infer vector for unseen author #1766 (in develop now), I think this addition resolve current issue, you can install current develop branch or wait next 3.5.0 release .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature
Projects
None yet
Development

No branches or pull requests

4 participants