-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Adding sklearn wrapper for LDA code #932
Conversation
Please add tests, a changelog and a quick draft of an ipynb on how to use it with sklearn api. |
Wouldn't it be useful to also add a |
Yes, update is on the road map |
self.setattr(parameter, value) | ||
return self | ||
|
||
def fit(self,X,y=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: Spaces after commas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestions have been incorporated. thanks
""" | ||
if X is None: | ||
raise AttributeError("Corpus defined as none") | ||
self.lda_model = gensim.models.LdaModel(corpus=X,num_topics=self.n_topics, id2word=self.id2word, passes=self.passes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: we don't use vertical indent in gensim. Use hanging indent.
(plus, spaces & commas again)
corpus = [dictionary.doc2bow(text) for text in texts] | ||
|
||
|
||
class TestLdaModel: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Python classes should inherit from object (new-style classes).
self.assertTrue(isinstance(v, float)) | ||
|
||
if __name__ == '__main__': | ||
unittest.main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing newline at the end of file.
eta=self.eta,random_state=self.random_state) | ||
return self.lda_model | ||
self.lda_model = gensim.models.LdaModel( | ||
corpus=X, num_topics=self.n_topics, id2word=self.id2word, passes=self.passes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indent too large (should be a single level).
# might need to do more | ||
def get_term_topics(self,wordid,minimum_probability=None): | ||
return self.lda_model.get_document_topics(bow, minimum_probability=minimum_probability, | ||
minimum_phi_value=minimum_phi_value, per_word_topics=per_word_topics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No vertical indent in gensim; we use hanging indent (see PEP8 for examples).
self.model=base.LdaModel(id2word=dictionary,n_topics=2,passes=100) | ||
self.model.fit(corpus) | ||
|
||
def testPrintTopic(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a partial_fit test
|
||
|
||
class LdaModel(object): | ||
class LdaModel(models.LdaModel,object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename to SklearnWrapperLdaModel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a sklearn pipeline example with a classifier
|
||
* Added sklearn wrapper for LdaModel (Basic LDA Model) along with relevant test cases and ipynb draft. (@AadityaJ, | ||
[#932](https://github.com/RaRe-Technologies/gensim/pull/932)) | ||
* Add online learning feature to word2vec. (@isohyt [#900](https://github.com/RaRe-Technologies/gensim/pull/900)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please resolve merge conflicts. Only one line should be added to changelog. Remove extra 2 lines about other changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please merge in develop branch to remove merge conflicts
eval_every=10, iterations=50, gamma_threshold=0.001, | ||
minimum_probability=0.01, random_state=None): | ||
""" | ||
sklearn wrapper for LDA model.derived class for gensim.model.LdaModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Space after "."
from gensim import models | ||
|
||
|
||
class SklearnWrapperLdaModel(models.LdaModel,object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename the python file to "SklearnWrapperGensimLdaModel"
"metadata": {}, | ||
"source": [ | ||
"The wrapper available (as of now) are :\n", | ||
"* LdaModel (```gensim.sklearn_integration.base.LdaModel```),which implements gensim's ```LdaModel``` in a scikit-learn interface" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update ipynb with new names of .py file and of the class
dictionary_up = Dictionary(texts_update) | ||
corpus_up = [dictionary_up.doc2bow(text) for text in texts_update] | ||
self.model.partial_fit(corpus_up) | ||
self.testPrintTopic() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a test that checks that the values changed after the update
corpus = [dictionary.doc2bow(text) for text in texts] | ||
|
||
|
||
class TestLdaModel(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename the tests to TestSklearnLDAWrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start!
I didn't review the notebook but the code is in the right direction.
Will need some code style & language polishing before merging (punctuation, missing/extra whitespace, capitalization...).
@@ -5,6 +5,10 @@ Unreleased: | |||
|
|||
None | |||
|
|||
0.13.5, 2016-12-31 | |||
|
|||
* Added sklearn wrapper for LdaModel (Basic LDA Model) along with relevant test cases and ipynb draft. (@AadityaJ,[#932](https://github.com/RaRe-Technologies/gensim/pull/932)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is "Basic LDA Model"?
@@ -0,0 +1,116 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a good filename; please use lower case, with underscores _
to separate expressions where necessary.
from scipy.sparse.csr import csr_matrix | ||
|
||
|
||
class SklearnWrapperLdaModel(models.LdaModel,object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: space after comma.
Actually, not relevant at all, because LdaModel
already inherits from object naturally.
""" | ||
Base LDA module | ||
""" | ||
def __init__(self, corpus=None, num_topics=100, id2word=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code style: no vertical indent.
""" | ||
if self.corpus: | ||
models.LdaModel.__init__( | ||
self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No vertical indent.
from gensim import matutils | ||
|
||
texts = [['complier', 'system', 'computer'], | ||
['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect indentation.
self.corpus = corpus | ||
self.num_topics = num_topics | ||
self.id2word = id2word | ||
self.distributed = distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think stuff like distributed
would really work in sklearn. Same with training or storing very large models (sklearn makes lots of deep object copies internally).
self.gamma_threshold = gamma_threshold | ||
self.minimum_probability = minimum_probability | ||
self.random_state = random_state | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use normal #
code comments (not docstring """
comments).
Warnings: Must for sklearn API. Do not Remove. | ||
""" | ||
if isinstance(X, csr_matrix): | ||
self.corpus = matutils.Sparse2Corpus(X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about other cases of X
? Numpy array? List?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used csr matrix since it is the return type of fit_transform for CountVectorizer as well as other vectorizers (TfidfVectorizer, DictVectorizer)
Return topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples. | ||
Warnings: Must for sklearn API. Do not Remove. | ||
""" | ||
return self.get_document_topics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look right -- transform
accepts a corpus (~sequence or array of multiple examples), not a single document (~one example).
Thanks for the comments and suggestions. I am making some of the changes today itself. Rest I'll do in a couple of days. |
eval_every=self.eval_every, iterations=self.iterations, | ||
gamma_threshold=self.gamma_threshold, minimum_probability=self.minimum_probability, | ||
random_state=self.random_state) | ||
self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much indentation. Hanging indent would have the line starting one indent more than line 46
eval_every=self.eval_every, iterations=self.iterations, | ||
gamma_threshold=self.gamma_threshold, minimum_probability=self.minimum_probability, | ||
random_state=self.random_state) | ||
self, corpus=self.corpus, num_topics=self.num_topics, id2word=self.id2word, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much indent as above
@@ -98,8 +98,8 @@ def transform(self, bow, minimum_probability=None, minimum_phi_value=None, per_w | |||
Returns the topic distribution for the given document bow, as a list of (topic_id, topic_probability) 2-tuples. | |||
""" | |||
return self.get_document_topics( | |||
bow, minimum_probability=minimum_probability, | |||
minimum_phi_value=minimum_phi_value, per_word_topics=per_word_topics) | |||
bow, minimum_probability=minimum_probability, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too much indent as above
@AadityaJ Please add the new class to |
@AadityaJ Could you please update the tutorial for |
creating wrapper to make LdaModel of gensim scikit-learn API enabled