-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Ideas Scrapyard
This page created only as "history artifact", contains many ideas that non-relevant now.
Note: Consider integration with existing Python sLDA
Background: Supervised Latent Dirichlet Allocation (sLDA) [1] is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA) [2]. It is used in predicting the number of "Likes" for a post or the number of stars in a movie review.
In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.
While academic implementations of sLDA exist in C++ and R [3, 4], there is no Python implementation available. You will contribute a scalable implementation of sLDA to the Python data science world. A quality implementation will be widely used in the industry.
RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].
Goals
-
Demonstrate understanding of topic modeling theory and practice by describing, implementing and evaluating sLDA.
-
Implement a streamed sLDA that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.
-
Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).
Deliverables
-
Code: a pull request against gensim [5, 6] on github. [7] Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
-
Report: timings, memory use and accuracy of your sLDA implementation on the Cornell Movie Review Corpus [8] following the same methodology as in [1]. A summary of insights into parameter selection and tuning of sLDA.
Resources:
[3] sLDA implementation in C++
[4] Implementation of sLDA in R
[7] Gensim on github
[8] Movie Review Dataset from Cornell NLP group
Background: gensim implements fast routines for similarity retrieval ("give me documents similar to this one, using Latent Semantic Analysis"). The routines can make use of multiple cores (using BLAS), but not multiple machines. For large datasets, it is desirable to store shards in a distributed manner, across a cluster of computers. During querying, collect and merge results from all shards.
To do: Extend the sharding already present in gensim, so that different shards can reside on different computers. Design an API to make the concept of "shards" flexible, so that similarity classes that use different implementations (see k-NN above) can plug into it easily.
The network communication must use a fast protocol (Pyro? ØMQ?), so as to not increase query latency too much.
Resources: gensim mailing list.
Background: Dynamic topic models are generative models that can be used to analyze the evolution of (unobserved) topics of a collection of documents over time. This family of models was proposed by David Blei and John Lafferty and is an extension to Latent Dirichlet Allocation (LDA) that can handle sequential documents. DTM has already been implemented in Gensim by Google Summer of Code student @bhargavvader
To do: Implement a distributed cluster version of DTM, and a version that can use multiple cores on the same machine. Implement DIM in gensim and evaluate.
Implementation must accept data in stream format (sequence of document vectors). It can use NumPy/SciPy as building blocks, pushing as much number crunching in low-level (ideally, BLAS) routines as possible.
We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Gensim doesn't include any support for "timed streams", or time tags, at the moment. So part of this project will be engineering a clean API for this new functionality.
Resources: Dynamic Topic Models.
Original Blei&Lafferty article PDF.
Wang&Blei&Heckerman article on Continuous Time Dynamic Topic Model PDF.
Wang&McCallum: "Topics over time" PDF.
Academic implementation of DTM on David Blei's page.
Gensim implementation of DTM.
Background: Paisley, Wang, Blei, Jordan recently developed a stochastic variational version of nested HDP. It reportedly preforms better than HDP etc. (of course!).
To do: Implement this model (probably extending / replacing the existing online HDP implementation in gensim) and evaluate it. Optionally also implement a distributed cluster version, or a version that can use multiple cores on the same machine.
Implementation must accept data in stream format (sequence of document vectors), to allow large inputs.
We aim for robust, industry-strength implementations in gensim, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.
Resources: "Nested Hierarchical Dirichlet Processes" by John Paisley, Chong Wang, David M. Blei and Michael I. Jordan PDF.
Background Li, McCallum developed a hierarchical LDA-like model for document classification. They report 2-5% accuracy improvements over an LDA model on a test corpus. (http://people.cs.umass.edu/~mccallum/papers/pam-icml06.pdf)
An implementation of this model may provide additional alternatives in choice of model, which in some cases may be helpful.
An implementation must be heavily unit tested and and production-ready. It would use many of the same classes and methods as the LDA, which is a bonus in terms of a first pass at implementation.
Resources Blei, D., Griffiths, T., Jordan, M., & Tenenbaum, J. (2004). Hierarchical topic models and the nested Chinese restaurant process. NIPS.
Blei, D., Ng, A., & Jordan, M. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Diggle, P. J., & Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society B, 46, 193–227.
Li, W., Blei, D., & McCallum, A. (2007). Nonparametric Bayes pachinko allocation.
Li, W., & McCallum, A. (2006). Pachinko allocation: DAG-structured mixture models of topic correlations. ICML.
Minka, T. (2000). Estimating a Dirichlet distribution. Rosen-Zvi, M., Griffiths, T., Steyvers, M., & Smyth, P. (2004). The author-topic model for authors and documents. UAI.
Wallach, H. M. (2006). Topic modeling: beyond bag-ofwords. ICML.
Integrate or re-write in an optimized way the glove word-embedding code by Maciej Kula (https://github.com/maciejkula/glove-python). Next step would be adding Swivel algorithm support
Much better performance than current variational inference way to fit LDA.
Either implement in Python or find a way to load the model trained on Spark.
Shows how good your word2vec model is on specific syntactic and semantic tasks. Wrapper around this code https://github.com/ytsvetko/qvec
https://research.googleblog.com/2016/08/text-summarization-with-tensorflow.html
Translate from R into Python using existing Gensim code. Medium difficulty.
From gensim issue suggestion: "Hi, it seems that wordspace model is very useful (http://infomap-nlp.sourceforge.net/doc/algorithm.html and https://cran.r-project.org/web/packages/wordspace/index.html). It is similar to the lsa model except that wordspace decomposes a co-occurrence matrix instead of term-document matrix."
A sense embedding is able to learn multiple representations per word capturing different word meanings.
Integrate one of existing word sense embeddings into gensim. Adagram is the best one currently.
Low priority as rarely appears in production.
Consider:
Change HashDictionary to use cuckoo hashing.
Hat-tip to A. Mueller
Bidirectional LSTM Recurrent Neural Network" paper
See paper
Slight modification of word2vec for the purpose of sponsored advertising. See this paper "Joint Embedding of Query and Ad by Leveraging Implicit Feedback"
See https://github.com/ruiEnca/ohDoclus
See code and paper at https://github.com/slanglab/phrasemachine
See how it works, compare to existing techniques, maybe get in touch for inclusion / robust reimplementation in gensim.
Code from 2014: https://github.com/cangermueller/vbmfa
Implement in tensorflow. Reproduce this paper or try another GAN model
See discussion in https://github.com/RaRe-Technologies/gensim/issues/1135#issuecomment-277529491