-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Adding sklearn wrapper for LDA code #932
Merged
Merged
Changes from 38 commits
Commits
Show all changes
48 commits
Select commit
Hold shift + click to select a range
08f417c
adding basic sklearn wrapper for LDA code
AadityaJ 61a6f8c
updating changelog
AadityaJ 66be324
adding test case,adding id2word,deleting showtopics
AadityaJ cffa95b
adding relevant ipynb
AadityaJ 10badc6
adding transfrom and other get methods and modifying print_topics
AadityaJ 62a4d2f
stylizing code to follow conventions
AadityaJ b7eff2d
removing redundant default argumen values
AadityaJ 2a193fd
adding partial_fit
AadityaJ a32f8dc
adding a line in test_sklearn_integration
AadityaJ a048ddc
using LDAModel as Parent Class
AadityaJ ac1d28e
adding docs, modifying getparam
AadityaJ 0d6cc0a
changing class name.Adding comments
AadityaJ 5d8c1a6
adding test case for update and transform
AadityaJ 894784c
adding init
AadityaJ 7a5ca4b
updating changes,fixed typo and changing file name
AadityaJ b35baba
deleted base.py
AadityaJ 13a136d
adding better testPartialFit method and minor changes due to change i…
AadityaJ 682f045
change name of test class
AadityaJ 9fda951
adding changes in classname to ipynb
AadityaJ 380ea5f
Merge branch 'develop' into sklearn_lda
AadityaJ e2485d4
Updating CHANGELOG.md
AadityaJ 3015896
Updated Main Model. Added fit_predict to class for example
AadityaJ a76eda4
added sklearn countvectorizer example to ipynb
AadityaJ 97c1530
adding logistic regression example
AadityaJ 20a63ac
adding if condition for csr_matrix to ldamodel
AadityaJ c0b2c5c
adding check for fit csrmatrix also stylizing code
AadityaJ bd656a8
Merge branch 'develop' into sklearn_lda
AadityaJ d749ba0
minor bug.solved, fit should convert X to corpus
AadityaJ 21119c5
removing fit_predict.adding csr_matrix check for update
AadityaJ 14f984b
minor updates in ipynb
AadityaJ a3895b5
adding rst file
AadityaJ f832737
removed "basic" , added rst update to log
AadityaJ bc352a0
changing indentation in texts
AadityaJ 7cc39da
added file preamble, removed unnecessary space
AadityaJ 0ba233c
following more pep8 conventions
AadityaJ e23a8a4
removing unnecessary comments
AadityaJ 041a32e
changing isinstance csr_matrix to issparse
AadityaJ e7120f0
changed to hanging indentation
AadityaJ 8a0950d
changing main filename
AadityaJ bd8bced
changing module name in test
AadityaJ bb5872b
updating ipynb with main filename
AadityaJ 777576e
changed class name
AadityaJ e50c3f9
changed file name
AadityaJ e521269
fixing filename typo
AadityaJ 51931fa
adding html file
AadityaJ 7ba30d6
deleting html file
AadityaJ 82d1fdc
vertical indentation fixes
AadityaJ 4f3441e
adding file to apiref.rst
AadityaJ File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,325 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Using wrappers for Scikit learn API" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This tutorial is about using gensim models as a part of your scikit learn workflow with the help of wrappers found at ```gensim.sklearn_integration.SklearnWrapperGensimLdaModel```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The wrapper available (as of now) are :\n", | ||
"* LdaModel (```gensim.sklearn_integration.SklearnWrapperGensimLdaModel.SklearnWrapperLdaModel```),which implements gensim's ```LdaModel``` in a scikit-learn interface" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### LdaModel" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"To use LdaModel begin with importing LdaModel wrapper" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from gensim.sklearn_integration.SklearnWrapperGensimLdaModel import SklearnWrapperLdaModel" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next we will create a dummy set of texts and convert it into a corpus" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from gensim.corpora import Dictionary\n", | ||
"texts = [['complier', 'system', 'computer'],\n", | ||
" ['eulerian', 'node', 'cycle', 'graph', 'tree', 'path'],\n", | ||
" ['graph', 'flow', 'network', 'graph'],\n", | ||
" ['loading', 'computer', 'system'],\n", | ||
" ['user', 'server', 'system'],\n", | ||
" ['tree','hamiltonian'],\n", | ||
" ['graph', 'trees'],\n", | ||
" ['computer', 'kernel', 'malfunction','computer'],\n", | ||
" ['server','system','computer']]\n", | ||
"dictionary = Dictionary(texts)\n", | ||
"corpus = [dictionary.doc2bow(text) for text in texts]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Then to run the LdaModel on it" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"WARNING:gensim.models.ldamodel:too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy\n" | ||
] | ||
}, | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[(0,\n", | ||
" u'0.164*\"computer\" + 0.117*\"system\" + 0.105*\"graph\" + 0.061*\"server\" + 0.057*\"tree\" + 0.046*\"malfunction\" + 0.045*\"kernel\" + 0.045*\"complier\" + 0.043*\"loading\" + 0.039*\"hamiltonian\"'),\n", | ||
" (1,\n", | ||
" u'0.102*\"graph\" + 0.083*\"system\" + 0.072*\"tree\" + 0.064*\"server\" + 0.059*\"user\" + 0.059*\"computer\" + 0.057*\"trees\" + 0.056*\"eulerian\" + 0.055*\"node\" + 0.052*\"flow\"')]" | ||
] | ||
}, | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"model=SklearnWrapperLdaModel(num_topics=2,id2word=dictionary,iterations=20, random_state=1)\n", | ||
"model.fit(corpus)\n", | ||
"model.print_topics(2)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"source": [ | ||
"### Integration with Sklearn" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"To provide a better example of how it can be used with Sklearn, Let's use CountVectorizer method of sklearn. For this example we will use [20 Newsgroups data set](http://qwone.com/~jason/20Newsgroups/). We will only use the categories rec.sport.baseball and sci.crypt and use it to generate topics." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import numpy as np\n", | ||
"from gensim import matutils\n", | ||
"from gensim.models.ldamodel import LdaModel\n", | ||
"from sklearn.datasets import fetch_20newsgroups\n", | ||
"from sklearn.feature_extraction.text import CountVectorizer\n", | ||
"from gensim.sklearn_integration.SklearnWrapperGensimLdaModel import SklearnWrapperLdaModel" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"rand = np.random.mtrand.RandomState(1) # set seed for getting same result\n", | ||
"cats = ['rec.sport.baseball', 'sci.crypt']\n", | ||
"data = fetch_20newsgroups(subset='train',\n", | ||
" categories=cats,\n", | ||
" shuffle=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next, we use countvectorizer to convert the collection of text documents to a matrix of token counts." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"vec = CountVectorizer(min_df=10, stop_words='english')\n", | ||
"\n", | ||
"X = vec.fit_transform(data.data)\n", | ||
"vocab = vec.get_feature_names() #vocab to be converted to id2word \n", | ||
"\n", | ||
"id2word=dict([(i, s) for i, s in enumerate(vocab)])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next, we just need to fit X and id2word to our Lda wrapper." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"[(0,\n", | ||
" u'0.018*\"cryptography\" + 0.018*\"face\" + 0.017*\"fierkelab\" + 0.008*\"abuse\" + 0.007*\"constitutional\" + 0.007*\"collection\" + 0.007*\"finish\" + 0.007*\"150\" + 0.007*\"fast\" + 0.006*\"difference\"'),\n", | ||
" (1,\n", | ||
" u'0.022*\"corporate\" + 0.022*\"accurate\" + 0.012*\"chance\" + 0.008*\"decipher\" + 0.008*\"example\" + 0.008*\"basically\" + 0.008*\"dawson\" + 0.008*\"cases\" + 0.008*\"consideration\" + 0.008*\"follow\"'),\n", | ||
" (2,\n", | ||
" u'0.034*\"argue\" + 0.031*\"456\" + 0.031*\"arithmetic\" + 0.024*\"courtesy\" + 0.020*\"beastmaster\" + 0.019*\"bitnet\" + 0.015*\"false\" + 0.015*\"classified\" + 0.014*\"cubs\" + 0.014*\"digex\"'),\n", | ||
" (3,\n", | ||
" u'0.108*\"abroad\" + 0.089*\"asking\" + 0.060*\"cryptography\" + 0.035*\"certain\" + 0.030*\"ciphertext\" + 0.030*\"book\" + 0.028*\"69\" + 0.028*\"demand\" + 0.028*\"87\" + 0.027*\"cracking\"'),\n", | ||
" (4,\n", | ||
" u'0.022*\"clark\" + 0.019*\"authentication\" + 0.017*\"candidates\" + 0.016*\"decryption\" + 0.015*\"attempt\" + 0.013*\"creation\" + 0.013*\"1993apr5\" + 0.013*\"acceptable\" + 0.013*\"algorithms\" + 0.013*\"employer\"')]" | ||
] | ||
}, | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"obj=SklearnWrapperLdaModel(id2word=id2word,num_topics=5,passes=20)\n", | ||
"lda=obj.fit(X)\n", | ||
"lda.print_topics()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"source": [ | ||
"#### Using together with Scikit learn's Logistic Regression" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now lets try Sklearn's logistic classifier to classify the given categories into two types.Ideally we should get postive weights when cryptography is talked about and negative when baseball is talked about." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn import linear_model" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"def print_features(clf, vocab, n=10):\n", | ||
" ''' Better printing for sorted list '''\n", | ||
" coef = clf.coef_[0]\n", | ||
" print 'Positive features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[::-1][:n] if coef[j] > 0]))\n", | ||
" print 'Negative features: %s' % (' '.join(['%s:%.2f' % (vocab[j], coef[j]) for j in np.argsort(coef)[:n] if coef[j] < 0]))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 10, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Positive features: clipper:1.50 code:1.24 key:1.04 encryption:0.95 chip:0.37 nsa:0.37 government:0.36 uk:0.36 org:0.23 cryptography:0.23\n", | ||
"Negative features: baseball:-1.32 game:-0.71 year:-0.61 team:-0.38 edu:-0.27 games:-0.26 players:-0.23 ball:-0.17 season:-0.14 phillies:-0.11\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"clf=linear_model.LogisticRegression(penalty='l1', C=0.1) #l1 penalty used\n", | ||
"clf.fit(X,data.target)\n", | ||
"print_features(clf,vocab)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 2", | ||
"language": "python", | ||
"name": "python2" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 2 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython2", | ||
"version": "2.7.6" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |
9 changes: 9 additions & 0 deletions
9
docs/src/sklearn_integration/SklearnWrapperGensimLdaModel.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
:mod:`sklearn_integration.SklearnWrapperGensimLdaModel.SklearnWrapperLdaModel` -- Scikit learn wrapper for Latent Dirichlet Allocation | ||
====================================================== | ||
|
||
.. automodule:: gensim.sklearn_integration.SklearnWrapperGensimLdaModel.SklearnWrapperLdaModel | ||
:synopsis: Scikit learn wrapper for LDA model | ||
:members: | ||
:inherited-members: | ||
:undoc-members: | ||
:show-inheritance: |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the examples to ipynb from https://gist.github.com/AadityaJ/c98da3d01f76f068242c17b5e1593973
Remove the
Dummy
code from the gist and add that conversion code to your wrapper.You don't have to use pipeline syntax.