piskvorky · Mar 1, 2011 · Feb 28, 2011
diff --git a/CHANGELOG.txt b/CHANGELOG.txt
@@ -52,7 +52,7 @@ Changes
 0.6.0
 
 * added option for online LSI training (yay!). the transformation can now be
-  used after any amount of training, and training can be continued at any time 
+  used after any amount of training, and training can be continued at any time
   with more data.
 * optimized the tf-idf transformation, so that it is a strictly one-pass algorithm in all cases  (thx to Brian Merrell).
 * fixed Windows-specific bug in handling binary files (thx to Sutee Sudprasert)

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,7 +1,7 @@
-recursive-include docs *
-recursive-include src/gensim/test testcorpus*
-recursive-include src *.sh
-prune docs/src*
-include COPYING
-include COPYING.LESSER
-include ez_setup.py
+recursive-include docs *
+recursive-include src/gensim/test testcorpus*
+recursive-include src *.sh
+prune docs/src*
+include COPYING
+include COPYING.LESSER
+include ez_setup.py
diff --git a/README.txt b/README.txt
@@ -4,7 +4,7 @@ gensim -- Python Framework for Topic Modelling
 
 
 
-Gensim is a Python library for *Vector Space Modelling* with very large corpora. 
+Gensim is a Python library for *Vector Space Modelling* with very large corpora.
 Target audience is the *Natural Language Processing* (NLP) community.
 
 
@@ -17,14 +17,14 @@ Features
   * easy to plug in your own input corpus/datastream (trivial streaming API)
   * easy to extend with other Vector Space algorithms (trivial transformation API)
 
-* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis**, 
+* Efficient implementations of popular algorithms, such as online **Latent Semantic Analysis**,
   **Latent Dirichlet Allocation** or **Random Projections**
 * **Distributed computing**: can run *Latent Semantic Analysis* and *Latent Dirichlet Allocation* on a cluster of computers.
 * Extensive `HTML documentation and tutorials <http://nlp.fi.muni.cz/projekty/gensim/>`_.
 
 
-If this feature list left you scratching your head, you can first read more about the `Vector 
-Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised 
+If this feature list left you scratching your head, you can first read more about the `Vector
+Space Model <http://en.wikipedia.org/wiki/Vector_space_model>`_ and `unsupervised
 document analysis <http://en.wikipedia.org/wiki/Latent_semantic_indexing>`_ on Wikipedia.
 
 Installation
@@ -37,14 +37,14 @@ The simple way to install `gensim` is::
 
     sudo easy_install gensim
 
-Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package, 
+Or, if you have instead downloaded and unzipped the `source tar.gz <http://pypi.python.org/pypi/gensim>`_ package,
 you'll need to run::
 
     python setup.py test
     sudo python setup.py install
 
 
-For alternative modes of installation (without root priviledges, development 
+For alternative modes of installation (without root priviledges, development
 installation, optional install features), see the `documentation <http://nlp.fi.muni.cz/projekty/gensim/install.html>`_.
 
 This version has been tested under Python 2.5 and 2.6, but should run on any 2.5 <= Python < 3.0.

diff --git a/docs/_sources/apiref.txt b/docs/_sources/apiref.txt
@@ -28,4 +28,3 @@ Modules:
     models/lda_worker
     similarities/docsim
 
-
diff --git a/docs/_sources/dist_lda.txt b/docs/_sources/dist_lda.txt
@@ -12,25 +12,25 @@ Setting up the cluster
 _______________________
 
 See the tutorial on :doc:`dist_lsi`; setting up a cluster for LDA is completely
-analogous, except you want to run `lda_worker` and `lda_dispatcher` scripts instead 
+analogous, except you want to run `lda_worker` and `lda_dispatcher` scripts instead
 of `lsi_worker` and `lsi_dispatcher`.
 
 Running LDA
 ____________
 
-Run LDA like you normally would, but turn on the `distributed=True` constructor 
+Run LDA like you normally would, but turn on the `distributed=True` constructor
 parameter::
 
     >>> # extract 100 LDA topics, using default parameters
     >>> lda = LdaModel(corpus=mm, id2word=id2word, numTopics=100, distributed=True)
     using distributed version with 4 workers
     running online LDA training, 100 topics, 1 passes over the supplied corpus of 3199665 documets, updating model once every 40000 documents
     ..
-    
+
 
 In serial mode (no distribution), creating this online LDA :doc:`model of Wikipedia <wiki>`
-takes 10h56m on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`). 
-In distributed mode with four workers (Linux, Xeons of 2Ghz, 4GB RAM 
+takes 10h56m on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
+In distributed mode with four workers (Linux, Xeons of 2Ghz, 4GB RAM
 with `ATLAS <http://math-atlas.sourceforge.net/>`_), the wallclock time taken drops to 3h20m.
 
 To run standard batch LDA (no online updates of mini-batches) instead, you would similarly
@@ -73,7 +73,7 @@ and then, some two days later::
     topic #17: 0.027*book + 0.021*published + 0.020*books + 0.014*isbn + 0.010*author + 0.010*magazine + 0.009*press + 0.009*novel + 0.009*writers + 0.008*story
     topic #18: 0.027*football + 0.024*players + 0.023*cup + 0.019*club + 0.017*fc + 0.017*footballers + 0.017*league + 0.011*season + 0.007*teams + 0.007*goals
     topic #19: 0.032*band + 0.024*album + 0.014*albums + 0.013*guitar + 0.013*rock + 0.011*records + 0.011*vocals + 0.009*live + 0.008*bass + 0.008*track
-    
+
 
 
 If you used the distributed LDA implementation in `gensim`, please let me know (my

diff --git a/docs/_sources/dist_lsi.txt b/docs/_sources/dist_lsi.txt
@@ -11,7 +11,7 @@ Distributed Latent Semantic Analysis
 Setting up the cluster
 _______________________
 
-We will show how to run distributed Latent Semantic Analysis by means of an example. 
+We will show how to run distributed Latent Semantic Analysis by means of an example.
 Let's say we have 5 computers at our disposal, all in the same broadcast domain.
 To start with, install `gensim` and `Pyro` on each one of them with::
 
@@ -21,41 +21,41 @@ and run Pyro's name server on exactly *one* of the machines (doesn't matter whic
 
   $ python -m Pyro.naming &
 
-Let's say our example cluster consists of dual-core computers with loads of 
-memory. We will therefore run **two** worker scripts on four of the physical machines, 
+Let's say our example cluster consists of dual-core computers with loads of
+memory. We will therefore run **two** worker scripts on four of the physical machines,
 creating **eight** logical worker nodes::
 
   $ python -m gensim.models.lsi_worker &
 
 This will execute `gensim`'s `lsi_worker.py` script (to be run twice on each of the
 four computer).
-This lets `gensim` know that it can run two jobs on each of the four computers in 
-parallel, so that the computation will be done faster, while also taking up twice 
+This lets `gensim` know that it can run two jobs on each of the four computers in
+parallel, so that the computation will be done faster, while also taking up twice
 as much memory on each machine.
 
-Next, pick one computer that will be a job scheduler in charge of worker 
-synchronization, and on it, run `LSA dispatcher`. In our example, we will use the 
+Next, pick one computer that will be a job scheduler in charge of worker
+synchronization, and on it, run `LSA dispatcher`. In our example, we will use the
 fifth computer to act as the dispatcher and from there run::
 
   $ python -m gensim.models.lsi_dispatcher &
 
-In general, the dispatcher can be run on the same machine as one of the worker nodes, or it 
+In general, the dispatcher can be run on the same machine as one of the worker nodes, or it
 can be another, distinct computer within the same broadcast domain. The dispatcher
 won't be  doing much with CPU most of the time, but pick a computer with ample memory.
 
 And that's it! The cluster is set up and running, ready to accept jobs. To remove
 a worker later on, simply terminate its `lsi_worker` process. To add another worker, run another
 `lsi_worker` (this will not affect a computation that is already running). If you terminate
-`lsi_dispatcher`, you won't be able to run computations until you run it again 
+`lsi_dispatcher`, you won't be able to run computations until you run it again
 (surviving workers can be re-used though).
 
 
 Running LSA
 ____________
 
-So let's test our setup and run one computation of distributed LSA. Open a Python 
+So let's test our setup and run one computation of distributed LSA. Open a Python
 shell on one of the five machines (again, this can be done on any computer
-in the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_, 
+in the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_,
 our choice is incidental) and try::
 
     >>> from gensim import corpora, models, utils
@@ -81,13 +81,13 @@ To check the LSA results, let's print the first two latent topics::
     topic #1(2.542): -0.623*"graph" + -0.490*"trees" + -0.451*"minors" + -0.274*"survey" + 0.167*"system"
 
 Success! But a corpus of nine documents is no challenge for our powerful cluster...
-In fact, we had to lower the job size (`chunks` parameter above) to a single document 
+In fact, we had to lower the job size (`chunks` parameter above) to a single document
 at a time, otherwise all documents would be processed by a single worker all at once.
 
 So let's run LSA on **one million documents** instead::
 
     >>> # inflate the corpus to 1M documents, by repeating its documents over&over
-    >>> corpus1m = utils.RepeatCorpus(corpus, 1000000) 
+    >>> corpus1m = utils.RepeatCorpus(corpus, 1000000)
     >>> # run distributed LSA on 1 million documents
     >>> lsi1m = models.LsiModel(corpus1m, id2word=id2word, numTopics=200, chunks=10000, distributed=True)
 
@@ -115,12 +115,12 @@ Latent Semantic Analysis on the English Wikipedia.
 Distributed LSA on Wikipedia
 ++++++++++++++++++++++++++++++
 
-First, download and prepare the Wikipedia corpus as per :doc:`wiki`, then load 
+First, download and prepare the Wikipedia corpus as per :doc:`wiki`, then load
 the corpus iterator with::
- 
+
     >>> import logging, gensim, bz2
     >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level logging.INFO)
-    
+
     >>> # load id->word mapping (the dictionary)
     >>> id2word = gensim.corpora.wikicorpus.WikiCorpus.loadDictionary('wiki_en_wordids.txt')
     >>> # load corpus iterator
@@ -134,7 +134,7 @@ Now we're ready to run distributed LSA on the English Wikipedia::
 
     >>> # extract 400 LSI topics, using a cluster of nodes
     >>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, numTopics=400, chunks=20000, distributed=True)
-    
+
     >>> # print the most contributing words (both positively and negatively) for each of the first ten topics
     >>> lsi.printTopics(10)
     2010-11-03 16:08:27,602 : INFO : topic #0(200.990): -0.475*"delete" + -0.383*"deletion" + -0.275*"debate" + -0.223*"comments" + -0.220*"edits" + -0.213*"modify" + -0.208*"appropriate" + -0.194*"subsequent" + -0.155*"wp" + -0.117*"notability"
@@ -148,10 +148,10 @@ Now we're ready to run distributed LSA on the English Wikipedia::
     2010-11-03 16:08:27,807 : INFO : topic #8(78.981): 0.588*"film" + 0.460*"films" + -0.130*"album" + -0.127*"station" + 0.121*"television" + 0.115*"poster" + 0.112*"directed" + 0.110*"actors" + -0.096*"railway" + 0.086*"movie"
     2010-11-03 16:08:27,834 : INFO : topic #9(78.620): 0.502*"kategori" + 0.282*"categoria" + 0.248*"kategorija" + 0.234*"kategorie" + 0.172*"категория" + 0.165*"categoría" + 0.161*"kategoria" + 0.148*"categorie" + 0.126*"kategória" + 0.121*"catégorie"
 
-In serial mode, creating the LSI model of Wikipedia with this **one-pass algorithm** 
-takes about 5.25h on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`). 
-In distributed mode with four workers (Linux, dual-core Xeons of 2Ghz, 4GB RAM 
-with `ATLAS`), the wallclock time taken drops to 1 hour and 41 minutes. You can 
-read more about various internal settings and experiments in my `research 
+In serial mode, creating the LSI model of Wikipedia with this **one-pass algorithm**
+takes about 5.25h on my laptop (OS X, C2D 2.53GHz, 4GB RAM with `libVec`).
+In distributed mode with four workers (Linux, dual-core Xeons of 2Ghz, 4GB RAM
+with `ATLAS`), the wallclock time taken drops to 1 hour and 41 minutes. You can
+read more about various internal settings and experiments in my `research
 paper <http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_.
 
diff --git a/docs/_sources/distributed.txt b/docs/_sources/distributed.txt
@@ -7,8 +7,8 @@ Why distributed computing?
 ---------------------------
 
 Need to build semantic representation of a corpus that is millions of documents large and it's
-taking forever? Have several idle machines at your disposal that you could use? 
-`Distributed computing <http://en.wikipedia.org/wiki/Distributed_computing>`_ tries 
+taking forever? Have several idle machines at your disposal that you could use?
+`Distributed computing <http://en.wikipedia.org/wiki/Distributed_computing>`_ tries
 to accelerate computations by splitting a given task into several smaller subtasks,
 passing them on to several computing nodes in parallel.
 
@@ -22,15 +22,15 @@ much communication going on), so the network is allowed to be of relatively high
   most of the time consuming stuff is done inside low-level routines for linear algebra, inside
   NumPy, independent of any `gensim` code.
   **Installing a fast** `BLAS (Basic Linear Algebra) <http://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms>`_ **library
-  for NumPy can improve performance up to 15 times!** So before you start buying those extra computers, 
-  consider installing a fast, threaded BLAS that is optimized for your particular machine 
+  for NumPy can improve performance up to 15 times!** So before you start buying those extra computers,
+  consider installing a fast, threaded BLAS that is optimized for your particular machine
   (as opposed to a generic, binary-distributed library).
-  Options include your vendor's BLAS library (Intel's MKL, 
+  Options include your vendor's BLAS library (Intel's MKL,
   AMD's ACML, OS X's vecLib, Sun's Sunperf, ...) or some open-source alternative (GotoBLAS, ALTAS).
 
   To see what BLAS and LAPACK you are using, type into your shell::
-  
-    python -c 'import scipy; scipy.show_config()' 
+
+    python -c 'import scipy; scipy.show_config()'
 
 Prerequisites
 -----------------
@@ -61,33 +61,33 @@ inside `gensim` will automatically try to look for and enslave all available wor
 If at least one worker is found, things will run in the distributed mode; if not, in serial node.
 
 .. glossary::
-    
+
   Node
-    A logical working unit. Can correspond to a single physical machine, but you 
+    A logical working unit. Can correspond to a single physical machine, but you
     can also run multiple workers on one machine, resulting in multiple
     logical nodes.
-    
+
   Cluster
-    Several nodes which communicate over TCP/IP. Currently, network broadcasting 
-    is used to discover and connect all communicating nodes, so the nodes must lie 
+    Several nodes which communicate over TCP/IP. Currently, network broadcasting
+    is used to discover and connect all communicating nodes, so the nodes must lie
     within the same `broadcast domain <http://en.wikipedia.org/wiki/Broadcast_domain>`_.
-    
+
   Worker
-    A process which is created on each node. To remove a node from your cluster, 
-    simply kill its worker process. 
-    
+    A process which is created on each node. To remove a node from your cluster,
+    simply kill its worker process.
+
   Dispatcher
-    The dispatcher will be in charge of negotiating all computations, queueing and 
+    The dispatcher will be in charge of negotiating all computations, queueing and
     distributing ("dispatching") individual jobs to the workers. Computations never
     "talk" to worker nodes directly, only through this dispatcher. Unlike workers,
     there can only be one active dispatcher at a time in the cluster.
-    
+
 
 Available distributed algorithms
 ---------------------------------
 
 .. toctree::
    :maxdepth: 1
-   
+
    dist_lsi
    dist_lda
diff --git a/docs/_sources/index.txt b/docs/_sources/index.txt
@@ -9,10 +9,10 @@ Gensim -- Python Framework for Vector Space Modelling
 .. admonition:: What's new in version |version|?
 
    * faster and leaner **Latent Semantic Indexing (LSI)** and **Latent Dirichlet Allocation (LDA)**:
-   
+
      * :doc:`Processing the English Wikipedia <wiki>`, 3.2 million documents (`NIPS workshop paper <http://nlp.fi.muni.cz/~xrehurek/nips/rehurek_nips.pdf>`_)
      * :doc:`dist_lsi` & :doc:`dist_lda`
-     
+
    * Input corpus iterators can come from a compressed file (**bzip2**, **gzip**, ...), to save disk space when dealing with
      very large corpora.
    * `gensim` code now resides on `github <https://github.com/piskvorky/gensim/>`_.
@@ -23,7 +23,7 @@ For **installation** and **troubleshooting**, see the :doc:`installation <instal
 
 For **examples** on how to use it, try the :doc:`tutorials <tutorial>`.
 
-When **citing** `gensim` in academic papers, please use 
+When **citing** `gensim` in academic papers, please use
 `this BibTeX entry <http://nlp.fi.muni.cz/projekty/gensim/bibtex_gensim.bib>`_.
 
 
@@ -40,7 +40,7 @@ Quick Reference Example
 >>>
 >>> # convert another corpus to the latent space and index it
 >>> index = similarities.MatrixSimilarity(lsi[another_corpus])
->>> 
+>>>
 >>> # perform similarity query of a query in LSI space against the whole corpus
 >>> sims = index[query]
 
@@ -49,7 +49,7 @@ Quick Reference Example
 .. toctree::
    :hidden:
    :maxdepth: 1
-   
+
    intro
    install
    tutorial