-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] LDA tutorial, tips and tricks #779
Conversation
Could you link to it in |
Will do @tmylk. Does that mean you think its ok as it is? In that case I'll just clean it up one of these days to prepare it for merging. |
@olavurmortensen When do you think this would be finished? |
@tmylk Well, I thought you would have some comments. If you do not, then I think I can finish it tomorrow. |
"\n", | ||
"> **Note:**\n", | ||
">\n", | ||
"> This tutorial uses the scikit-learn and nltk libraries, although you can replace them with others if you want. Python 3 is used, although Python 2.7 can be used as well.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is sklearn used? I cannot find it.
Briefly looked through it this morning, all I can add right now would be an explanation as to why 5 models trained with exactly the same input would have different output, e.g., perplexity, and perhaps an aside on how to achieve 1:1 training with the random state parameter. It seems like a common question. |
@cscorley The 5 models are different because of random initialization (specifically, random initialization of some hyperparameters, e.g. gamma). But you bring up an important point that this should be explained in the tutorial, and maybe even set the random state just make it more explicit. |
I have updated the tutorial according to the comments, thanks @piskvorky and @cscorley. Also added a link in I also changed the name because, bizarrely, someone posted a tutorial with the same name I was using just a week ago. @tmylk and @piskvorky Will the tutorial appear on the RaRe blog? |
@tmylk @piskvorky Ready for merging. The conflict is because I changed the When it's merged I'll submit a blog post on WordPress for review as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The title should be changed to 'Pre-processing and training LDA' The value of this tutorial is in explaining pre-processing steps and the meaning of LDA parameters. The model selection is not really covered here as deeply as in Topic Coherence or 'America's Next Topic model' blog post.
"source": [ | ||
"# LDA: training tips\n", | ||
"\n", | ||
"LDA is a probabilistic hierarchical Bayesian model that is a mixture model as well as a mixed membership model... but we won't be getting into any of that.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence should say what this tutorial is. Here it is done in the Nth sentence - please move it to the very first line. It is ok to discuss what this tutorial is not, but it should be later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. The first sentence is also a tad snarky. I just removed the first sentence, does anything else need to be changed in that regard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"In this tutorial I will show how to pre-process text and train LDA on it"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better now?
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We select the model with the lowest perplexity." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you do that? you talk about topic coherence later so it is confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree this is not a good way of selecting a model at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the sections about model selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"[pyLDAvis](https://pyldavis.readthedocs.io/en/latest/index.html) can be fun and useful. Include the code below in your notebook to visualize your topics with pyLDAvis.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add actual pyLDAVis output to the notebook
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rendering pyLDAvis output in the notebook completely messes up the scale of the notebook, so I'd rather not include it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
either include the actual picture or remove the code and link to a pyLDAvis tutorial. Only code serves no purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the text about it. Come to think of it, since it's mentioned in other RaRe blogs, there isn't much need for it in this one.
…ntence, removed model selection based on perplexity.
… stuff about pyLDAvis.
@tmylk @piskvorky A tutorial on LDA sharing some of my experience, as requested.
@tmylk I'm sure you have some comments on it. Thought it would be easiest with a PR. It's still a work in progress, as reflected by the "TODO" list in the start of the tutorial.
Not exactly sure what you would like the tutorial to be, but I tried to explain what the goal of it was in the introduction.