Fix Pipeline #1213

kris-singh · 2017-03-14T02:19:38Z

Solves PR #932

kris-singh · 2017-03-14T03:08:19Z

Do we have to add sklearn as dependencies. Ready for review

tmylk

Please provide a pipeline in tests and ipynb where output is lda is used in logistic regression

tmylk · 2017-03-14T19:45:04Z

gensim/test/test_sklearn_integration.py

@@ -67,5 +68,15 @@ def testCSRMatrixConversion(self):
            self.assertTrue(isinstance(v, six.string_types))
            self.assertTrue(isinstance(k, int))

+    def testPipline(self):
+        model = SklearnWrapperLdaModel(id2word=dictionary, num_topics=2, passes=100, minimum_probability=0, random_state=numpy.random.seed(0))
+        text_lda = Pipeline([('model', model)])


Can a pipeline contain two things? From lda to logistic regression would be good. Also could you please add it to the tutorial.

Do, you mean to say that we use lda as a feature extractor. And then use it to in the logistic regression. I thought of this and modified the transform function accordingly.

tmylk · 2017-03-14T19:45:42Z

docs/notebooks/sklearn_wrapper.ipynb

-   "source": []
+   "source": [
+    "def scorer(estimator, X,y=None):\n",
+    "    goodcm = CoherenceModel(model=estimator, texts= texts, dictionary=estimator.id2word, coherence='c_v')\n",


This gridsearch returns exception in the ipynb. Is it possible to have it fixed?

kris-singh · 2017-03-16T20:29:44Z

@tmylk Could you have a look at the Travis . I don't understand why is it failing.

tmylk · 2017-03-17T00:38:43Z

Tests fixed by smart_open update

tmylk · 2017-03-17T00:57:06Z

gensim/test/test_sklearn_integration.py

@@ -67,5 +84,21 @@ def testCSRMatrixConversion(self):
            self.assertTrue(isinstance(v, six.string_types))
            self.assertTrue(isinstance(k, int))

+    def testPipline(self):


typo in name of the function

tmylk · 2017-03-17T00:57:35Z

gensim/test/test_sklearn_integration.py

+        data = fetch_20newsgroups(subset='train',
+                                  categories=cats,
+                                  shuffle=True)
+        text_lda = Pipeline([('features', vec),('model', model)])


please add logistic regression to the pipeline to analyse output of the lda

I do that in the ipynb example as you had suggested. Also, I am not getting good accuracy using the features from lda transform around 52% which is meaningless for a binary classification task.

please add it to the test.
accuracy is not important here. it is about being in compatible format

tmylk · 2017-03-17T00:58:00Z

gensim/test/test_sklearn_integration.py

+        vec = CountVectorizer(min_df=10, stop_words='english')
+        rand = numpy.random.mtrand.RandomState(1) # set seed for getting same result
+        cats = ['rec.sport.baseball', 'sci.crypt']
+        data = fetch_20newsgroups(subset='train',


there are smaller datasets in test_data folder. downloading a lot of data makes tests run too long

@tmylk i was not able to find a dataset that the labels. If you know can you please tell me which one to use.

a tiny 100k subset of newsgroups would be ok.

what is the size of the text docs that you are adding?

kris-singh · 2017-03-19T07:00:00Z

@tmylk All changes made. Ready for merge. Please let me know if further changes are required.

kris-singh · 2017-03-20T02:13:46Z

@tmylk any other issues that will help with nmf that i could possibly look at.

tmylk · 2017-03-20T18:49:49Z

gensim/test/test_sklearn_integration.py

@@ -86,19 +92,15 @@ def testCSRMatrixConversion(self):

    def testPipline(self):


typo in test name

tmylk · 2017-03-20T18:57:26Z

Thanks! The PR looks good.
For completeess, could you please remove the section inappropriately called "Using together with Scikit learn's Logistic Regression". That section doesn't use gensim at all so shouldn't be in the notebook. It's an omission by the original author.

Please put your new Pipeline section instead of it so users can find it faster.

kris-singh · 2017-03-21T13:33:41Z

Changes made. Also the size of the test file is around 300 kb.

tmylk · 2017-03-21T18:50:45Z

Thanks for the new feature!

piskvorky · 2017-04-09T07:05:13Z

@tmylk this PR has multiple coding style and PEP8 issues. Please do not merge PRs that are not ready for merging.

piskvorky · 2017-04-09T07:13:09Z

docs/notebooks/sklearn_wrapper.ipynb

   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
-    "from sklearn import linear_model"
+    "def scorer(estimator, X,y=None):\n",


PEP8: space after comma.

piskvorky · 2017-04-09T07:14:23Z

docs/notebooks/sklearn_wrapper.ipynb

+   },
+   "outputs": [],
+   "source": [
+    "id2word=Dictionary(map(lambda x : x.split(),data.data))\n",


map is discouraged -- use comprehensions and generators.

Also, PEP8 -- space after comma, spaces around =.

piskvorky · 2017-04-09T07:15:09Z

docs/notebooks/sklearn_wrapper.ipynb

-    "clf=linear_model.LogisticRegression(penalty='l1', C=0.1) #l1 penalty used\n",
-    "clf.fit(X,data.target)\n",
-    "print_features(clf,vocab)"
+    "model=SklearnWrapperLdaModel(num_topics=15,id2word=id2word,iterations=50, random_state=37)\n",


PEP8: spaces around assignment operator =. Other space/formatting/PEP8 issues further down this file, but this is the last comment.

piskvorky · 2017-04-09T07:16:12Z

gensim/sklearn_integration/sklearn_wrapper_gensim_ldamodel.py

@@ -109,4 +134,4 @@ def partial_fit(self, X):
        if sparse.issparse(X):
            X = matutils.Sparse2Corpus(X)

-        self.update(corpus=X)
+        self.update(corpus=X)


PEP8: newline at the end of file.

Fix Pipeline

8098e56

sklearn dependency

40ffca0

tmylk suggested changes Mar 14, 2017

View reviewed changes

[email protected] added 5 commits March 16, 2017 04:48

Changes Added

36b8a81

Changes Made

7391fcc

minor fix

efe96e6

.travis

a133b49

try

970df21

Fix for >3.5

768e39b

tmylk reviewed Mar 17, 2017

View reviewed changes

[email protected] added 3 commits March 19, 2017 10:46

Changes Made

3bccc20

Compressed Data

b709026

add data

6d15ae7

tmylk reviewed Mar 20, 2017

View reviewed changes

Typo Fixed

de600ba

tmylk merged commit 97cd64f into piskvorky:develop Mar 21, 2017

piskvorky reviewed Apr 9, 2017

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Pipeline #1213

Fix Pipeline #1213

kris-singh commented Mar 14, 2017 •

edited by tmylk

Loading

kris-singh commented Mar 14, 2017 •

edited

Loading

tmylk left a comment

tmylk Mar 14, 2017

kris-singh Mar 16, 2017

tmylk Mar 14, 2017

kris-singh commented Mar 16, 2017

tmylk commented Mar 17, 2017

tmylk Mar 17, 2017

tmylk Mar 17, 2017

kris-singh Mar 17, 2017 •

edited

Loading

tmylk Mar 17, 2017

tmylk Mar 17, 2017

kris-singh Mar 18, 2017

tmylk Mar 19, 2017

tmylk Mar 20, 2017

kris-singh commented Mar 19, 2017

kris-singh commented Mar 20, 2017

tmylk Mar 20, 2017

tmylk commented Mar 20, 2017

kris-singh commented Mar 21, 2017

tmylk commented Mar 21, 2017

piskvorky commented Apr 9, 2017 •

edited

Loading

piskvorky Apr 9, 2017

piskvorky Apr 9, 2017

piskvorky Apr 9, 2017

piskvorky Apr 9, 2017

		@@ -86,19 +92,15 @@ def testCSRMatrixConversion(self):

		def testPipline(self):

Fix Pipeline #1213

Fix Pipeline #1213

Conversation

kris-singh commented Mar 14, 2017 • edited by tmylk Loading

kris-singh commented Mar 14, 2017 • edited Loading

tmylk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kris-singh commented Mar 16, 2017

tmylk commented Mar 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kris-singh Mar 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kris-singh commented Mar 19, 2017

kris-singh commented Mar 20, 2017

Choose a reason for hiding this comment

tmylk commented Mar 20, 2017

kris-singh commented Mar 21, 2017

tmylk commented Mar 21, 2017

piskvorky commented Apr 9, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kris-singh commented Mar 14, 2017 •

edited by tmylk

Loading

kris-singh commented Mar 14, 2017 •

edited

Loading

kris-singh Mar 17, 2017 •

edited

Loading

piskvorky commented Apr 9, 2017 •

edited

Loading