Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Additional documentation fixes #2121

Merged
merged 1 commit into from
Jul 31, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 100 additions & 0 deletions docs/src/_index.rst.unused
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@

:github_url: https://github.com/RaRe-Technologies/gensim

Gensim documentation
===================================

============
Introduction
============

Gensim is a free Python library designed to automatically extract semantic
topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim is designed to process raw, unstructured digital texts ("plain text").

The algorithms in Gensim, such as **Word2Vec**, **FastText**, **Latent Semantic Analysis**, **Latent Dirichlet Allocation** and **Random Projections**, discover semantic structure of documents by examining statistical co-occurrence patterns within a corpus of training documents. These algorithms are **unsupervised**, which means no human input is necessary -- you only need a corpus of plain text documents.

Once these statistical patterns are found, any plain text documents can be succinctly
expressed in the new, semantic representation and queried for topical similarity
against other documents, words or phrases.

.. note::
If the previous paragraphs left you confused, you can read more about the `Vector
Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
document analysis <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_ on Wikipedia.


.. _design:

Features
--------

* **Memory independence** -- there is no need for the whole training corpus to
reside fully in RAM at any one time (can process large, web-scale corpora).
* **Memory sharing** -- trained models can be persisted to disk and loaded back via mmap. Multiple processes can share the same data, cutting down RAM footprint.
* Efficient implementations for several popular vector space algorithms,
including Word2Vec, Doc2Vec, FastText, TF-IDF, Latent Semantic Analysis (LSI, LSA),
Latent Dirichlet Allocation (LDA) or Random Projection.
* I/O wrappers and readers from several popular data formats.
* Fast similarity queries for documents in their semantic representation.

The **principal design objectives** behind Gensim are:

1. Straightforward interfaces and low API learning curve for developers. Good for prototyping.
2. Memory independence with respect to the size of the input corpus; all intermediate
steps and algorithms operate in a streaming fashion, accessing one document
at a time.

.. seealso::

We built a high performance server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai.
ScaleText is a commercial product, available both on-prem or as SaaS.
Reach out at [email protected] if you need an industry-grade tool with professional support.

.. _availability:

Availability
------------

Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license <http://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html>`_ and can be downloaded either from its `github repository <https://github.com/piskvorky/gensim/>`_ or from the `Python Package Index <http://pypi.python.org/pypi/gensim>`_.

.. seealso::

See the :doc:`install <install>` page for more info on Gensim deployment.


.. toctree::
:glob:
:maxdepth: 1
:caption: Getting started

install
intro
support
about
license
citing


.. toctree::
:maxdepth: 1
:caption: Tutorials

tutorial
tut1
tut2
tut3


.. toctree::
:maxdepth: 1
:caption: API Reference

apiref

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
26 changes: 26 additions & 0 deletions docs/src/_license.rst.unused
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
:orphan:

.. _license:

Licensing
---------

Gensim is licensed under the OSI-approved `GNU LGPLv2.1 license <http://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html>`_.

This means that it's free for both personal and commercial use, but if you make any
modification to Gensim that you distribute to other people, you have to disclose
the source code of these modifications.

Apart from that, you are free to redistribute Gensim in any way you like, though you're
not allowed to modify its license (doh!).

My intent here is to **get more help and community involvement** with the development of Gensim.
The legalese is therefore less important to me than your input and contributions.

`Contact me <mailto:[email protected]>`_ if LGPL doesn't fit your bill but you'd like the LGPL restrictions liften.

.. seealso::

We built a high performance server for NLP, document analysis, indexing, search and clustering: https://scaletext.ai.
ScaleText is a commercial product, available both on-prem or as SaaS.
Reach out at [email protected] if you need an industry-grade tool with professional support.
14 changes: 7 additions & 7 deletions gensim/models/doc2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,21 +20,21 @@
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-lee.ipynb>`_.

**Make sure you have a C compiler before installing Gensim, to use the optimized doc2vec routines** (70x speedup
compared to plain NumPy implementation <https://rare-technologies.com/parallelizing-word2vec-in-python/>`_).
compared to plain NumPy implementation, https://rare-technologies.com/parallelizing-word2vec-in-python/).


Examples
--------
Usage examples
==============

Initialize & train a model
Initialize & train a model:

>>> from gensim.test.utils import common_texts
>>> from gensim.models.doc2vec import Doc2Vec, TaggedDocument
>>>
>>> documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(common_texts)]
>>> model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

Persist a model to disk
Persist a model to disk:

>>> from gensim.test.utils import get_tmpfile
>>>
Expand All @@ -43,11 +43,11 @@
>>> model.save(fname)
>>> model = Doc2Vec.load(fname) # you can continue training with the loaded model!

If you're finished training a model (=no more updates, only querying, reduce memory usage), you can do
If you're finished training a model (=no more updates, only querying, reduce memory usage), you can do:

>>> model.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)

Infer vector for new document
Infer vector for a new document:

>>> vector = model.infer_vector(["system", "response"])

Expand Down
13 changes: 7 additions & 6 deletions gensim/models/fasttext.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@

This module contains a fast native C implementation of Fasttext with Python interfaces. It is **not** only a wrapper
around Facebook's implementation.

For a tutorial see `this noteboook
<https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb>`_.

Expand All @@ -22,14 +23,14 @@
Usage examples
--------------

Initialize and train a model
Initialize and train a model:

>>> from gensim.test.utils import common_texts
>>> from gensim.models import FastText
>>>
>>> model = FastText(common_texts, size=4, window=3, min_count=1, iter=10)

Persist a model to disk with
Persist a model to disk with:

>>> from gensim.test.utils import get_tmpfile
>>>
Expand All @@ -38,7 +39,7 @@
>>> model.save(fname)
>>> model = FastText.load(fname) # you can continue training with the loaded model!

Retrieve word-vector for vocab and out-of-vocab word
Retrieve word-vector for vocab and out-of-vocab word:

>>> existent_word = "computer"
>>> existent_word in model.wv.vocab
Expand All @@ -50,7 +51,7 @@
False
>>> oov_vec = model.wv[oov_word] # numpy vector for OOV word

You can perform various NLP word tasks with the model, some of them are already built-in
You can perform various NLP word tasks with the model, some of them are already built-in:

>>> similarities = model.wv.most_similar(positive=['computer', 'human'], negative=['interface'])
>>> most_similar = similarities[0]
Expand All @@ -62,13 +63,13 @@
>>>
>>> sim_score = model.wv.similarity('computer', 'human')

Correlation with human opinion on word similarity
Correlation with human opinion on word similarity:

>>> from gensim.test.utils import datapath
>>>
>>> similarities = model.wv.evaluate_word_pairs(datapath('wordsim353.tsv'))

And on word analogies
And on word analogies:

>>> analogies_result = model.wv.accuracy(datapath('questions-words.txt'))

Expand Down
13 changes: 7 additions & 6 deletions gensim/models/word2vec.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,12 +27,12 @@
visit https://rare-technologies.com/word2vec-tutorial/.

**Make sure you have a C compiler before installing Gensim, to use the optimized word2vec routines**
(70x speedup compared to plain NumPy implementation, https://rare-technologies.com/parallelizing-word2vec-in-python/.
(70x speedup compared to plain NumPy implementation, https://rare-technologies.com/parallelizing-word2vec-in-python/).

Usage examples
==============

Initialize a model with e.g.
Initialize a model with e.g.:

>>> from gensim.test.utils import common_texts, get_tmpfile
>>> from gensim.models import Word2Vec
Expand All @@ -45,13 +45,13 @@
The training is streamed, meaning `sentences` can be a generator, reading input data
from disk on-the-fly, without loading the entire corpus into RAM.

It also means you can continue training the model later
It also means you can continue training the model later:

>>> model = Word2Vec.load("word2vec.model")
>>> model.train([["hello", "world"]], total_examples=1, epochs=1)
(0, 2)

The trained word vectors are stored in a :class:`~gensim.models.KeyedVectors` instance in `model.wv`:
The trained word vectors are stored in a :class:`~gensim.models.keyedvectors.KeyedVectors` instance in `model.wv`:

>>> vector = model.wv['computer'] # numpy vector of a word

Expand All @@ -68,7 +68,8 @@
>>> wv = KeyedVectors.load("model.wv", mmap='r')
>>> vector = wv['computer'] # numpy vector of a word

Gensim can also load word vectors in the "word2vec C format", as this :class:`~gensim.models.KeyedVectors` instance::
Gensim can also load word vectors in the "word2vec C format", as a
:class:`~gensim.models.keyedvectors.KeyedVectors` instance::

>>> from gensim.test.utils import datapath
>>>
Expand All @@ -84,7 +85,7 @@
are already built-in - you can see it in :mod:`gensim.models.keyedvectors`.

If you're finished training a model (i.e. no more updates, only querying),
you can switch to the :class:`~gensim.models.KeyedVectors` instance
you can switch to the :class:`~gensim.models.keyedvectors.KeyedVectors` instance:

>>> word_vectors = model.wv
>>> del model
Expand Down