##The DIY Guide to Gensim
Please make Pull Requests for good resources, or create Issues for any feedback! Thanks!
###Table Of Contents
Gensim is a very performant python library for NLP projects. It is arguably the most popular library for Word2Vec and Doc2Vec. In addition, it also provides various NLP tools such as LDA, LSI and Random Projection.
pip install gensim
easy_install gensim
###Hello World Just a simple code-based intro, theory is covered in the next section #####Text to Vectors
- We first need to transform text to vectors
- String to vectors tutorial
- Create a dictionary first that maps words to ids
- Transform the text into vectors through
dictionary.doc2bow(texts)
- Corpus streaming tutorial (For very large corpuses)
#####Models and Transformation
- Models (e.g. LsiModel, Word2Vec) are built / trained from a corpus
- Transformation interface tutorial
#####TF-IDF (Model)
- Docs, Source
- tf-idf scores are normalized (sum of squares of scores = 1)
#####Phrases (Model)
- Detects words that belong in a phrase, useful for models like Word2Vec ("new", "york" -> "new york")
- Docs, Source (uses bigram detectors underneath)
- Phrases example on How I Met Your Mother
#####LSI (Model)
- Docs, Source (very standard LSI implementation)
- How to interpret negative LSI values
- Random Projection (used as an option to speed up LSI)
#####LDA (Model)
#####Word2Vec (Model)
- Docs, Source (very simple interface)
- Simple word2vec tutorial (examples of
most_similar, similarity, doesnt_match
)
#####Doc2Vec (Model)
- Docs, Source (Docs are not very good)
- Doc2Vec requires a non-standard corpus (need sentiment label for each document)
- Great illustration of corpus preparation, Code (Alternative, Alternative 2)
- Doc2Vec on customer review (example)
###Theory
#####TFIDF
#####LSI
#####LDA
#####Word2Vec
#####Doc2Vec
- Paper
- [contribution needed: good resources that explain Doc2Vec]
###Advanced Features
#####Query Similarities
- Tool to get the most similar documents for LDA, LSI
- Similarity queries tutorial
#####Distributed Computing
- Run LSI and LDA on many computers
- Distributed computing tutorial
#####Similarity Server
- Elastic search-like server for document similarity calculated by LSI and LDA
- Similarity server tutorial
###Super Short Feedback Survey (Pretty please!)