Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Adding sklearn wrapper for LDA code #932

Merged
merged 48 commits into from
Jan 29, 2017

Conversation

AadityaJ
Copy link
Contributor

creating wrapper to make LdaModel of gensim scikit-learn API enabled

@tmylk
Copy link
Contributor

tmylk commented Oct 10, 2016

Please add tests, a changelog and a quick draft of an ipynb on how to use it with sklearn api.

@devashishd12
Copy link
Contributor

Wouldn't it be useful to also add a partial_fit which wraps update?

@tmylk
Copy link
Contributor

tmylk commented Oct 10, 2016

Yes, update is on the road map

self.setattr(parameter, value)
return self

def fit(self,X,y=None):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: Spaces after commas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestions have been incorporated. thanks

"""
if X is None:
raise AttributeError("Corpus defined as none")
self.lda_model = gensim.models.LdaModel(corpus=X,num_topics=self.n_topics, id2word=self.id2word, passes=self.passes,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: we don't use vertical indent in gensim. Use hanging indent.

(plus, spaces & commas again)

corpus = [dictionary.doc2bow(text) for text in texts]


class TestLdaModel:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All Python classes should inherit from object (new-style classes).

self.assertTrue(isinstance(v, float))

if __name__ == '__main__':
unittest.main()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing newline at the end of file.

eta=self.eta,random_state=self.random_state)
return self.lda_model
self.lda_model = gensim.models.LdaModel(
corpus=X, num_topics=self.n_topics, id2word=self.id2word, passes=self.passes,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent too large (should be a single level).

# might need to do more
def get_term_topics(self,wordid,minimum_probability=None):
return self.lda_model.get_document_topics(bow, minimum_probability=minimum_probability,
minimum_phi_value=minimum_phi_value, per_word_topics=per_word_topics)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indent in gensim; we use hanging indent (see PEP8 for examples).

self.model=base.LdaModel(id2word=dictionary,n_topics=2,passes=100)
self.model.fit(corpus)

def testPrintTopic(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a partial_fit test



class LdaModel(object):
class LdaModel(models.LdaModel,object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename to SklearnWrapperLdaModel

Copy link
Contributor

@tmylk tmylk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a sklearn pipeline example with a classifier


* Added sklearn wrapper for LdaModel (Basic LDA Model) along with relevant test cases and ipynb draft. (@AadityaJ,
[#932](https://github.com/RaRe-Technologies/gensim/pull/932))
* Add online learning feature to word2vec. (@isohyt [#900](https://github.com/RaRe-Technologies/gensim/pull/900))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve merge conflicts. Only one line should be added to changelog. Remove extra 2 lines about other changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge in develop branch to remove merge conflicts

eval_every=10, iterations=50, gamma_threshold=0.001,
minimum_probability=0.01, random_state=None):
"""
sklearn wrapper for LDA model.derived class for gensim.model.LdaModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space after "."

from gensim import models


class SklearnWrapperLdaModel(models.LdaModel,object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the python file to "SklearnWrapperGensimLdaModel"

"metadata": {},
"source": [
"The wrapper available (as of now) are :\n",
"* LdaModel (```gensim.sklearn_integration.base.LdaModel```),which implements gensim's ```LdaModel``` in a scikit-learn interface"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update ipynb with new names of .py file and of the class

dictionary_up = Dictionary(texts_update)
corpus_up = [dictionary_up.doc2bow(text) for text in texts_update]
self.model.partial_fit(corpus_up)
self.testPrintTopic()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test that checks that the values changed after the update

corpus = [dictionary.doc2bow(text) for text in texts]


class TestLdaModel(unittest.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename the tests to TestSklearnLDAWrapper

Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start!

I didn't review the notebook but the code is in the right direction.

Will need some code style & language polishing before merging (punctuation, missing/extra whitespace, capitalization...).

@@ -5,6 +5,10 @@ Unreleased:

None

0.13.5, 2016-12-31

* Added sklearn wrapper for LdaModel (Basic LDA Model) along with relevant test cases and ipynb draft. (@AadityaJ,[#932](https://github.com/RaRe-Technologies/gensim/pull/932))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "Basic LDA Model"?

@@ -0,0 +1,116 @@
#!/usr/bin/env python
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a good filename; please use lower case, with underscores _ to separate expressions where necessary.

from scipy.sparse.csr import csr_matrix


class SklearnWrapperLdaModel(models.LdaModel,object):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: space after comma.

Actually, not relevant at all, because LdaModel already inherits from object naturally.

"""
Base LDA module
"""
def __init__(self, corpus=None, num_topics=100, id2word=None,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code style: no vertical indent.

"""
if self.corpus:
models.LdaModel.__init__(
self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No vertical indent.

from gensim import matutils

texts = [['complier', 'system', 'computer'],
['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation.

self.corpus = corpus
self.num_topics = num_topics
self.id2word = id2word
self.distributed = distributed
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think stuff like distributed would really work in sklearn. Same with training or storing very large models (sklearn makes lots of deep object copies internally).

self.gamma_threshold = gamma_threshold
self.minimum_probability = minimum_probability
self.random_state = random_state
"""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use normal # code comments (not docstring """ comments).

Warnings: Must for sklearn API. Do not Remove.
"""
if isinstance(X, csr_matrix):
self.corpus = matutils.Sparse2Corpus(X)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about other cases of X? Numpy array? List?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used csr matrix since it is the return type of fit_transform for CountVectorizer as well as other vectorizers (TfidfVectorizer, DictVectorizer)

Return topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples.
Warnings: Must for sklearn API. Do not Remove.
"""
return self.get_document_topics(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right -- transform accepts a corpus (~sequence or array of multiple examples), not a single document (~one example).

@AadityaJ
Copy link
Contributor Author

AadityaJ commented Jan 6, 2017

Thanks for the comments and suggestions. I am making some of the changes today itself. Rest I'll do in a couple of days.

@tmylk
Copy link
Contributor

tmylk commented Jan 8, 2017

@AadityaJ About the RST files. Here is an example of what needs to be added. Here is how to re-generate API ref and make sure that new files are generated.

eval_every=self.eval_every, iterations=self.iterations,
gamma_threshold=self.gamma_threshold, minimum_probability=self.minimum_probability,
random_state=self.random_state)
self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much indentation. Hanging indent would have the line starting one indent more than line 46

eval_every=self.eval_every, iterations=self.iterations,
gamma_threshold=self.gamma_threshold, minimum_probability=self.minimum_probability,
random_state=self.random_state)
self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much indent as above

@@ -98,8 +98,8 @@ def transform(self, bow, minimum_probability=None, minimum_phi_value=None, per_w
Returns the topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples.
"""
return self.get_document_topics(
bow, minimum_probability=minimum_probability,
minimum_phi_value=minimum_phi_value, per_word_topics=per_word_topics)
bow, minimum_probability=minimum_probability,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much indent as above

@tmylk
Copy link
Contributor

tmylk commented Jan 10, 2017

@AadityaJ Please add the new class to gensim/docs/src/apiref.rst

@piskvorky piskvorky changed the title [WIP]Adding sklearn wrapper for LDA code [WIP] Adding sklearn wrapper for LDA code Jan 10, 2017
@tmylk tmylk merged commit 0e0c082 into piskvorky:develop Jan 29, 2017
@tmylk
Copy link
Contributor

tmylk commented Feb 14, 2017

@AadityaJ Could you please update the tutorial for clf.fit to use the output of LDA? Currently it is using sklearn's CountVectorizer in X as input.

@kris-singh kris-singh mentioned this pull request Mar 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants