-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Pipeline #1213
Fix Pipeline #1213
Conversation
Do we have to add sklearn as dependencies. Ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide a pipeline in tests and ipynb where output is lda is used in logistic regression
@@ -67,5 +68,15 @@ def testCSRMatrixConversion(self): | |||
self.assertTrue(isinstance(v, six.string_types)) | |||
self.assertTrue(isinstance(k, int)) | |||
|
|||
def testPipline(self): | |||
model = SklearnWrapperLdaModel(id2word=dictionary, num_topics=2, passes=100, minimum_probability=0, random_state=numpy.random.seed(0)) | |||
text_lda = Pipeline([('model', model)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can a pipeline contain two things? From lda to logistic regression would be good. Also could you please add it to the tutorial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do, you mean to say that we use lda as a feature extractor. And then use it to in the logistic regression. I thought of this and modified the transform function accordingly.
docs/notebooks/sklearn_wrapper.ipynb
Outdated
"source": [] | ||
"source": [ | ||
"def scorer(estimator, X,y=None):\n", | ||
" goodcm = CoherenceModel(model=estimator, texts= texts, dictionary=estimator.id2word, coherence='c_v')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gridsearch returns exception in the ipynb. Is it possible to have it fixed?
@tmylk Could you have a look at the Travis . I don't understand why is it failing. |
Tests fixed by smart_open update |
@@ -67,5 +84,21 @@ def testCSRMatrixConversion(self): | |||
self.assertTrue(isinstance(v, six.string_types)) | |||
self.assertTrue(isinstance(k, int)) | |||
|
|||
def testPipline(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in name of the function
data = fetch_20newsgroups(subset='train', | ||
categories=cats, | ||
shuffle=True) | ||
text_lda = Pipeline([('features', vec),('model', model)]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add logistic regression to the pipeline to analyse output of the lda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do that in the ipynb example as you had suggested. Also, I am not getting good accuracy using the features from lda transform around 52% which is meaningless for a binary classification task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add it to the test.
accuracy is not important here. it is about being in compatible format
vec = CountVectorizer(min_df=10, stop_words='english') | ||
rand = numpy.random.mtrand.RandomState(1) # set seed for getting same result | ||
cats = ['rec.sport.baseball', 'sci.crypt'] | ||
data = fetch_20newsgroups(subset='train', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are smaller datasets in test_data folder. downloading a lot of data makes tests run too long
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmylk i was not able to find a dataset that the labels. If you know can you please tell me which one to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a tiny 100k subset of newsgroups would be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the size of the text docs that you are adding?
@tmylk All changes made. Ready for merge. Please let me know if further changes are required. |
@tmylk any other issues that will help with nmf that i could possibly look at. |
@@ -86,19 +92,15 @@ def testCSRMatrixConversion(self): | |||
|
|||
def testPipline(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo in test name
Thanks! The PR looks good. Please put your new Pipeline section instead of it so users can find it faster. |
Changes made. Also the size of the test file is around 300 kb. |
Thanks for the new feature! |
@tmylk this PR has multiple coding style and PEP8 issues. Please do not merge PRs that are not ready for merging. |
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn import linear_model" | ||
"def scorer(estimator, X,y=None):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: space after comma.
}, | ||
"outputs": [], | ||
"source": [ | ||
"id2word=Dictionary(map(lambda x : x.split(),data.data))\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map
is discouraged -- use comprehensions and generators.
Also, PEP8 -- space after comma, spaces around =
.
"clf=linear_model.LogisticRegression(penalty='l1', C=0.1) #l1 penalty used\n", | ||
"clf.fit(X,data.target)\n", | ||
"print_features(clf,vocab)" | ||
"model=SklearnWrapperLdaModel(num_topics=15,id2word=id2word,iterations=50, random_state=37)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: spaces around assignment operator =
. Other space/formatting/PEP8 issues further down this file, but this is the last comment.
@@ -109,4 +134,4 @@ def partial_fit(self, X): | |||
if sparse.issparse(X): | |||
X = matutils.Sparse2Corpus(X) | |||
|
|||
self.update(corpus=X) | |||
self.update(corpus=X) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: newline at the end of file.
Solves PR #932