Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

big trailing whitespace cleanup #6

Merged
1 commit merged into from
Mar 1, 2011
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Changes
0.6.0

* added option for online LSI training (yay!). the transformation can now be
used after any amount of training, and training can be continued at any time
used after any amount of training, and training can be continued at any time
with more data.
* optimized the tf-idf transformation, so that it is a strictly one-pass algorithm in all cases (thx to Brian Merrell).
* fixed Windows-specific bug in handling binary files (thx to Sutee Sudprasert)
Expand Down
14 changes: 7 additions & 7 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
recursive-include docs *
recursive-include src/gensim/test testcorpus*
recursive-include src *.sh
prune docs/src*
include COPYING
include COPYING.LESSER
include ez_setup.py
recursive-include docs *
recursive-include src/gensim/test testcorpus*
recursive-include src *.sh
prune docs/src*
include COPYING
include COPYING.LESSER
include ez_setup.py
12 changes: 6 additions & 6 deletions README.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ gensim -- Python Framework for Topic Modelling



Gensim is a Python library for *Vector Space Modelling* with very large corpora.
Gensim is a Python library for *Vector Space Modelling* with very large corpora.
Target audience is the *Natural Language Processing* (NLP) community.


Expand All @@ -17,14 +17,14 @@ Features
* easy to plug in your own input corpus/datastream (trivial streaming API)
* easy to extend with other Vector Space algorithms (trivial transformation API)

* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis**,
* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis**,
**Latent Dirichlet Allocation** or **Random Projections**
* **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers.
* Extensive `HTML documentation and tutorials <http://nlp.fi.muni.cz/projekty/gensim/>`_.


If this feature list left you scratching your head, you can first read more about the `Vector
Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
If this feature list left you scratching your head, you can first read more about the `Vector
Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
document analysis <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_ on Wikipedia.

Installation
Expand All @@ -37,14 +37,14 @@ The simple way to install `gensim` is::

sudo easy_install gensim

Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package,
Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package,
you'll need to run::

python setup.py test
sudo python setup.py install


For alternative modes of installation (without root priviledges, development
For alternative modes of installation (without root priviledges, development
installation, optional install features), see the `documentation <http://nlp.fi.muni.cz/projekty/gensim/install.html>`_.

This version has been tested under Python 2.5 and 2.6, but should run on any 2.5 <= Python < 3.0.
Expand Down
1 change: 0 additions & 1 deletion docs/_sources/apiref.txt
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,3 @@ Modules:
models/lda_worker
similarities/docsim


12 changes: 6 additions & 6 deletions docs/_sources/dist_lda.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,25 +12,25 @@ Setting up the cluster
_______________________

See the tutorial on :doc:`dist_lsi`; setting up a cluster for LDA is completely
analogous, except you want to run `lda_worker` and `lda_dispatcher` scripts instead
analogous, except you want to run `lda_worker` and `lda_dispatcher` scripts instead
of `lsi_worker` and `lsi_dispatcher`.

Running LDA
____________

Run LDA like you normally would, but turn on the `distributed=True` constructor
Run LDA like you normally would, but turn on the `distributed=True` constructor
parameter::

>>> # extract 100 LDA topics, using default parameters
>>> lda = LdaModel(corpus=mm, id2word=id2word, numTopics=100, distributed=True)
using distributed version with 4 workers
running online LDA training, 100 topics, 1 passes over the supplied corpus of 3199665 documets, updating model once every 40000 documents
..


In serial mode (no distribution), creating this online LDA :doc:`model of Wikipedia <wiki>`
takes 10h56m on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
In distributed mode with four workers (Linux, Xeons of 2Ghz, 4GB RAM
takes 10h56m on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
In distributed mode with four workers (Linux, Xeons of 2Ghz, 4GB RAM
with `ATLAS <http://math-atlas.sourceforge.net/>`_), the wallclock time taken drops to 3h20m.

To run standard batch LDA (no online updates of mini-batches) instead, you would similarly
Expand Down Expand Up @@ -73,7 +73,7 @@ and then, some two days later::
topic #17: 0.027*book + 0.021*published + 0.020*books + 0.014*isbn + 0.010*author + 0.010*magazine + 0.009*press + 0.009*novel + 0.009*writers + 0.008*story
topic #18: 0.027*football + 0.024*players + 0.023*cup + 0.019*club + 0.017*fc + 0.017*footballers + 0.017*league + 0.011*season + 0.007*teams + 0.007*goals
topic #19: 0.032*band + 0.024*album + 0.014*albums + 0.013*guitar + 0.013*rock + 0.011*records + 0.011*vocals + 0.009*live + 0.008*bass + 0.008*track



If you used the distributed LDA implementation in `gensim`, please let me know (my
Expand Down
44 changes: 22 additions & 22 deletions docs/_sources/dist_lsi.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Distributed Latent Semantic Analysis
Setting up the cluster
_______________________

We will show how to run distributed Latent Semantic Analysis by means of an example.
We will show how to run distributed Latent Semantic Analysis by means of an example.
Let's say we have 5 computers at our disposal, all in the same broadcast domain.
To start with, install `gensim` and `Pyro` on each one of them with::

Expand All @@ -21,41 +21,41 @@ and run Pyro's name server on exactly *one* of the machines (doesn't matter whic

$ python -m Pyro.naming &

Let's say our example cluster consists of dual-core computers with loads of
memory. We will therefore run **two** worker scripts on four of the physical machines,
Let's say our example cluster consists of dual-core computers with loads of
memory. We will therefore run **two** worker scripts on four of the physical machines,
creating **eight** logical worker nodes::

$ python -m gensim.models.lsi_worker &

This will execute `gensim`'s `lsi_worker.py` script (to be run twice on each of the
four computer).
This lets `gensim` know that it can run two jobs on each of the four computers in
parallel, so that the computation will be done faster, while also taking up twice
This lets `gensim` know that it can run two jobs on each of the four computers in
parallel, so that the computation will be done faster, while also taking up twice
as much memory on each machine.

Next, pick one computer that will be a job scheduler in charge of worker
synchronization, and on it, run `LSA dispatcher`. In our example, we will use the
Next, pick one computer that will be a job scheduler in charge of worker
synchronization, and on it, run `LSA dispatcher`. In our example, we will use the
fifth computer to act as the dispatcher and from there run::

$ python -m gensim.models.lsi_dispatcher &

In general, the dispatcher can be run on the same machine as one of the worker nodes, or it
In general, the dispatcher can be run on the same machine as one of the worker nodes, or it
can be another, distinct computer within the same broadcast domain. The dispatcher
won't be doing much with CPU most of the time, but pick a computer with ample memory.

And that's it! The cluster is set up and running, ready to accept jobs. To remove
a worker later on, simply terminate its `lsi_worker` process. To add another worker, run another
`lsi_worker` (this will not affect a computation that is already running). If you terminate
`lsi_dispatcher`, you won't be able to run computations until you run it again
`lsi_dispatcher`, you won't be able to run computations until you run it again
(surviving workers can be re-used though).


Running LSA
____________

So let's test our setup and run one computation of distributed LSA. Open a Python
So let's test our setup and run one computation of distributed LSA. Open a Python
shell on one of the five machines (again, this can be done on any computer
in the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_,
in the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_,
our choice is incidental) and try::

>>> from gensim import corpora, models, utils
Expand All @@ -81,13 +81,13 @@ To check the LSA results, let's print the first two latent topics::
topic #1(2.542): -0.623*"graph" + -0.490*"trees" + -0.451*"minors" + -0.274*"survey" + 0.167*"system"

Success! But a corpus of nine documents is no challenge for our powerful cluster...
In fact, we had to lower the job size (`chunks` parameter above) to a single document
In fact, we had to lower the job size (`chunks` parameter above) to a single document
at a time, otherwise all documents would be processed by a single worker all at once.

So let's run LSA on **one million documents** instead::

>>> # inflate the corpus to 1M documents, by repeating its documents over&over
>>> corpus1m = utils.RepeatCorpus(corpus, 1000000)
>>> corpus1m = utils.RepeatCorpus(corpus, 1000000)
>>> # run distributed LSA on 1 million documents
>>> lsi1m = models.LsiModel(corpus1m, id2word=id2word, numTopics=200, chunks=10000, distributed=True)

Expand Down Expand Up @@ -115,12 +115,12 @@ Latent Semantic Analysis on the English Wikipedia.
Distributed LSA on Wikipedia
++++++++++++++++++++++++++++++

First, download and prepare the Wikipedia corpus as per :doc:`wiki`, then load
First, download and prepare the Wikipedia corpus as per :doc:`wiki`, then load
the corpus iterator with::

>>> import logging, gensim, bz2
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level logging.INFO)

>>> # load id->word mapping (the dictionary)
>>> id2word = gensim.corpora.wikicorpus.WikiCorpus.loadDictionary('wiki_en_wordids.txt')
>>> # load corpus iterator
Expand All @@ -134,7 +134,7 @@ Now we're ready to run distributed LSA on the English Wikipedia::

>>> # extract 400 LSI topics, using a cluster of nodes
>>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, numTopics=400, chunks=20000, distributed=True)

>>> # print the most contributing words (both positively and negatively) for each of the first ten topics
>>> lsi.printTopics(10)
2010-11-03 16:08:27,602 : INFO : topic #0(200.990): -0.475*"delete" + -0.383*"deletion" + -0.275*"debate" + -0.223*"comments" + -0.220*"edits" + -0.213*"modify" + -0.208*"appropriate" + -0.194*"subsequent" + -0.155*"wp" + -0.117*"notability"
Expand All @@ -148,10 +148,10 @@ Now we're ready to run distributed LSA on the English Wikipedia::
2010-11-03 16:08:27,807 : INFO : topic #8(78.981): 0.588*"film" + 0.460*"films" + -0.130*"album" + -0.127*"station" + 0.121*"television" + 0.115*"poster" + 0.112*"directed" + 0.110*"actors" + -0.096*"railway" + 0.086*"movie"
2010-11-03 16:08:27,834 : INFO : topic #9(78.620): 0.502*"kategori" + 0.282*"categoria" + 0.248*"kategorija" + 0.234*"kategorie" + 0.172*"категория" + 0.165*"categoría" + 0.161*"kategoria" + 0.148*"categorie" + 0.126*"kategória" + 0.121*"catégorie"

In serial mode, creating the LSI model of Wikipedia with this **one-pass algorithm**
takes about 5.25h on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
In distributed mode with four workers (Linux, dual-core Xeons of 2Ghz, 4GB RAM
with `ATLAS`), the wallclock time taken drops to 1 hour and 41 minutes. You can
read more about various internal settings and experiments in my `research
In serial mode, creating the LSI model of Wikipedia with this **one-pass algorithm**
takes about 5.25h on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
In distributed mode with four workers (Linux, dual-core Xeons of 2Ghz, 4GB RAM
with `ATLAS`), the wallclock time taken drops to 1 hour and 41 minutes. You can
read more about various internal settings and experiments in my `research
paper <http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_.

38 changes: 19 additions & 19 deletions docs/_sources/distributed.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ Why distributed computing?
---------------------------

Need to build semantic representation of a corpus that is millions of documents large and it's
taking forever? Have several idle machines at your disposal that you could use?
`Distributed computing <http://en.wikipedia.org/wiki/Distributed_computing>`_ tries
taking forever? Have several idle machines at your disposal that you could use?
`Distributed computing <http://en.wikipedia.org/wiki/Distributed_computing>`_ tries
to accelerate computations by splitting a given task into several smaller subtasks,
passing them on to several computing nodes in parallel.

Expand All @@ -22,15 +22,15 @@ much communication going on), so the network is allowed to be of relatively high
most of the time consuming stuff is done inside low-level routines for linear algebra, inside
NumPy, independent of any `gensim` code.
**Installing a fast** `BLAS (Basic Linear Algebra) <http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms>`_ **library
for NumPy can improve performance up to 15 times!** So before you start buying those extra computers,
consider installing a fast, threaded BLAS that is optimized for your particular machine
for NumPy can improve performance up to 15 times!** So before you start buying those extra computers,
consider installing a fast, threaded BLAS that is optimized for your particular machine
(as opposed to a generic, binary-distributed library).
Options include your vendor's BLAS library (Intel's MKL,
Options include your vendor's BLAS library (Intel's MKL,
AMD's ACML, OS X's vecLib, Sun's Sunperf, ...) or some open-source alternative (GotoBLAS, ALTAS).

To see what BLAS and LAPACK you are using, type into your shell::
python -c 'import scipy; scipy.show_config()'

python -c 'import scipy; scipy.show_config()'

Prerequisites
-----------------
Expand Down Expand Up @@ -61,33 +61,33 @@ inside `gensim` will automatically try to look for and enslave all available wor
If at least one worker is found, things will run in the distributed mode; if not, in serial node.

.. glossary::

Node
A logical working unit. Can correspond to a single physical machine, but you
A logical working unit. Can correspond to a single physical machine, but you
can also run multiple workers on one machine, resulting in multiple
logical nodes.

Cluster
Several nodes which communicate over TCP/IP. Currently, network broadcasting
is used to discover and connect all communicating nodes, so the nodes must lie
Several nodes which communicate over TCP/IP. Currently, network broadcasting
is used to discover and connect all communicating nodes, so the nodes must lie
within the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_.

Worker
A process which is created on each node. To remove a node from your cluster,
simply kill its worker process.
A process which is created on each node. To remove a node from your cluster,
simply kill its worker process.

Dispatcher
The dispatcher will be in charge of negotiating all computations, queueing and
The dispatcher will be in charge of negotiating all computations, queueing and
distributing ("dispatching") individual jobs to the workers. Computations never
"talk" to worker nodes directly, only through this dispatcher. Unlike workers,
there can only be one active dispatcher at a time in the cluster.


Available distributed algorithms
---------------------------------

.. toctree::
:maxdepth: 1

dist_lsi
dist_lda
10 changes: 5 additions & 5 deletions docs/_sources/index.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,10 @@ Gensim -- Python Framework for Vector Space Modelling
.. admonition:: What's new in version |version|?

* faster and leaner **Latent Semantic Indexing (LSI)** and **Latent Dirichlet Allocation (LDA)**:

* :doc:`Processing the English Wikipedia <wiki>`, 3.2 million documents (`NIPS workshop paper <http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_)
* :doc:`dist_lsi` & :doc:`dist_lda`

* Input corpus iterators can come from a compressed file (**bzip2**, **gzip**, ...), to save disk space when dealing with
very large corpora.
* `gensim` code now resides on `github <https://github.com/piskvorky/gensim/>`_.
Expand All @@ -23,7 +23,7 @@ For **installation** and **troubleshooting**, see the :doc:`installation <instal

For **examples** on how to use it, try the :doc:`tutorials <tutorial>`.

When **citing** `gensim` in academic papers, please use
When **citing** `gensim` in academic papers, please use
`this BibTeX entry <http://nlp.fi.muni.cz/projekty/gensim/bibtex_gensim.bib>`_.


Expand All @@ -40,7 +40,7 @@ Quick Reference Example
>>>
>>> # convert another corpus to the latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[another_corpus])
>>>
>>>
>>> # perform similarity query of a query in LSI space against the whole corpus
>>> sims = index[query]

Expand All @@ -49,7 +49,7 @@ Quick Reference Example
.. toctree::
:hidden:
:maxdepth: 1

intro
install
tutorial
Expand Down
Loading