Skip to content

Commit

Permalink
Merge branch 'master' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
menshikh-iv committed Jul 6, 2018
2 parents 398a00d + 006a6ee commit 19e725d
Show file tree
Hide file tree
Showing 5 changed files with 3,273 additions and 2,752 deletions.
202 changes: 202 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,207 @@
Changes
===========
## 3.5.0, 2018-07-06

This release comprises a glorious 38 pull requests from 28 contributors. Most of the effort went into improving the documentation—hence the release code name "Docs 💬"!

Apart from the **massive overhaul of all Gensim documentation** (including docstring style and examples—[you asked for it](https://rare-technologies.com/gensim-survey-2018/)), we also managed to sneak in some new functionality and a number of bug fixes. As usual, see the notes below for a complete list, with links to pull requests for more details.

**Huge thanks to all contributors!** Nobody loves working on documentation. 3.5.0 is a result of several months of laborious, unglamorous, and sometimes invisible work. Enjoy!


### :books: Documentation improvements

* Overhaul documentation for `*2vec` models (__[@steremma](https://github.com/steremma)__ & __[@piskvorky](https://github.com/piskvorky)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1944](https://github.com/RaRe-Technologies/gensim/pull/1944), [#2087](https://github.com/RaRe-Technologies/gensim/pull/2087))
* Fix documentation for LDA-related models (__[@steremma](https://github.com/steremma)__ & __[@piskvorky](https://github.com/piskvorky)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2026](https://github.com/RaRe-Technologies/gensim/pull/2026))
* Fix documentation for utils, corpora, inferfaces (__[@piskvorky](https://github.com/piskvorky)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2096](https://github.com/RaRe-Technologies/gensim/pull/2096))
* Update non-API docs (about, intro, license etc) (__[@piskvorky](https://github.com/piskvorky)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2101](https://github.com/RaRe-Technologies/gensim/pull/2101))
* Refactor documentation for `gensim.models.phrases` (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1950](https://github.com/RaRe-Technologies/gensim/pull/1950))
* Fix HashDictionary documentation (__[@piskvorky](https://github.com/piskvorky)__, [#2073](https://github.com/RaRe-Technologies/gensim/pull/2073))
* Fix docstrings for `gensim.models.AuthorTopicModel` (__[@souravsingh](https://github.com/souravsingh)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1907](https://github.com/RaRe-Technologies/gensim/pull/1907))
* Fix docstrings for HdpModel, lda_worker & lda_dispatcher (__[@gyanesh-m](https://github.com/gyanesh-m)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1912](https://github.com/RaRe-Technologies/gensim/pull/1912))
* Fix format & links for `gensim.similarities.docsim` (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#2030](https://github.com/RaRe-Technologies/gensim/pull/2030))
* Remove duplication of class documentation for `IndexedCorpus` (__[@darindf](https://github.com/darindf)__, [#2033](https://github.com/RaRe-Technologies/gensim/pull/2033))
* Refactor documentation for `gensim.models.coherencemodel` (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1933](https://github.com/RaRe-Technologies/gensim/pull/1933))
* Fix docstrings for `gensim.sklearn_api` (__[@steremma](https://github.com/steremma)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1895](https://github.com/RaRe-Technologies/gensim/pull/1895))
* Disable google-style docstring support (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#2106](https://github.com/RaRe-Technologies/gensim/pull/2106))
* Fix docstring of `gensim.models.KeyedVectors.similarity_matrix` (__[@Witiko](https://github.com/Witiko)__, [#1971](https://github.com/RaRe-Technologies/gensim/pull/1971))
* Consistently use `smart_open()` instead of `open()` in notebooks (__[@sharanry](https://github.com/sharanry)__, [#1812](https://github.com/RaRe-Technologies/gensim/pull/1812))


### :star2: New features:

* Add `add_entity` method to `KeyedVectors` to allow adding word vectors manually (__[@persiyanov](https://github.com/persiyanov)__, [#1957](https://github.com/RaRe-Technologies/gensim/pull/1957))
* Add inference for new unseen author to `AuthorTopicModel` (__[@Stamenov](https://github.com/Stamenov)__, [#1766](https://github.com/RaRe-Technologies/gensim/pull/1766))
* Add `evaluate_word_analogies` (will replace `accuracy`) method to `KeyedVectors` (__[@akutuzov](https://github.com/akutuzov)__, [#1935](https://github.com/RaRe-Technologies/gensim/pull/1935))
* Add Pivot Normalization to `TfidfModel` (__[@markroxor](https://github.com/markroxor)__, [#1780](https://github.com/RaRe-Technologies/gensim/pull/1780))



### :+1: Improvements

* Allow initialization with `max_final_vocab` in lieu of `min_count` in `Word2Vec`(__[@aneesh-joshi](https://github.com/aneesh-joshi)__, [#1915](https://github.com/RaRe-Technologies/gensim/pull/1915))
* Add `dtype` argument for `chunkize_serial` in `LdaModel` (__[@darindf](https://github.com/darindf)__, [#2027](https://github.com/RaRe-Technologies/gensim/pull/2027))
* Increase performance in `Phrases.analyze_sentence` (__[@JonathanHourany](https://github.com/JonathanHourany)__, [#2070](https://github.com/RaRe-Technologies/gensim/pull/2070))
* Add `ns_exponent` parameter to control the negative sampling distribution for `*2vec` models (__[@fernandocamargoti](https://github.com/fernandocamargoti)__, [#2093](https://github.com/RaRe-Technologies/gensim/pull/2093))


### :red_circle: Bug fixes:


* Fix `Doc2Vec.infer_vector` + notebook cleanup (__[@gojomo](https://github.com/gojomo)__, [#2103](https://github.com/RaRe-Technologies/gensim/pull/2103))
* Fix linear decay for learning rate in `Doc2Vec.infer_vector` (__[@umangv](https://github.com/umangv)__, [#2063](https://github.com/RaRe-Technologies/gensim/pull/2063))
* Fix negative sampling floating-point error for `gensim.models.Poincare (__[@jayantj](https://github.com/jayantj)__, [#1959](https://github.com/RaRe-Technologies/gensim/pull/1959))
* Fix loading `word2vec` and `doc2vec` models saved using old Gensim versions (__[@manneshiva](https://github.com/manneshiva)__, [#2012](https://github.com/RaRe-Technologies/gensim/pull/2012))
* Fix `SoftCosineSimilarity.get_similarities` on corpora ssues/1955) (__[@Witiko](https://github.com/Witiko)__, [#1972](https://github.com/RaRe-Technologies/gensim/pull/1972))
* Fix return dtype for `matutils.unitvec` according to input dtype (__[@o-P-o](https://github.com/o-P-o)__, [#1992](https://github.com/RaRe-Technologies/gensim/pull/1992))
* Fix passing empty dictionary to `gensim.corpora.WikiCorpus` (__[@steremma](https://github.com/steremma)__, [#2042](https://github.com/RaRe-Technologies/gensim/pull/2042))
* Fix bug in `Similarity.query_shards` in multiprocessing case (__[@bohea](https://github.com/bohea)__, [#2044](https://github.com/RaRe-Technologies/gensim/pull/2044))
* Fix SMART from TfidfModel for case when `df == "n"` (__[@PeteBleackley](https://github.com/PeteBleackley)__, [#2021](https://github.com/RaRe-Technologies/gensim/pull/2021))
* Fix OverflowError when loading a large term-document matrix in compiled MatrixMarket format (__[@arlenk](https://github.com/arlenk)__, [#2001](https://github.com/RaRe-Technologies/gensim/pull/2001))
* Update rules for removing table markup from Wikipedia dumps (__[@chaitaliSaini](https://github.com/chaitaliSaini)__, [#1954](https://github.com/RaRe-Technologies/gensim/pull/1954))
* Fix `_is_single` from `Phrases` for case when corpus is a NumPy array (__[@rmalouf](https://github.com/rmalouf)__, [#1987](https://github.com/RaRe-Technologies/gensim/pull/1987))
* Fix tests for `EuclideanKeyedVectors.similarity_matrix` (__[@Witiko](https://github.com/Witiko)__, [#1984](https://github.com/RaRe-Technologies/gensim/pull/1984))
* Fix deprecated parameters in `D2VTransformer` and `W2VTransformer`(__[@MritunjayMohitesh](https://github.com/MritunjayMohitesh)__, [#1945](https://github.com/RaRe-Technologies/gensim/pull/1945))
* Fix `Doc2Vec.infer_vector` after loading old `Doc2Vec` (`gensim<=3.2`)(__[@manneshiva](https://github.com/manneshiva)__, [#1974](https://github.com/RaRe-Technologies/gensim/pull/1974))
* Fix inheritance chain for `load_word2vec_format` (__[@DennisChen0307](https://github.com/DennisChen0307)__, [#1968](https://github.com/RaRe-Technologies/gensim/pull/1968))
* Update Keras version (avoid bug from `keras==2.1.5`) (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1963](https://github.com/RaRe-Technologies/gensim/pull/1963))



### :warning: Deprecations (will be removed in the next major release)
* Remove
- `gensim.models.wrappers.fasttext` (obsoleted by the new native `gensim.models.fasttext` implementation)
- `gensim.examples`
- `gensim.nosy`
- `gensim.scripts.word2vec_standalone`
- `gensim.scripts.make_wiki_lemma`
- `gensim.scripts.make_wiki_online`
- `gensim.scripts.make_wiki_online_lemma`
- `gensim.scripts.make_wiki_online_nodebug`
- `gensim.scripts.make_wiki` (all of these obsoleted by the new native `gensim.scripts.segment_wiki` implementation)
- "deprecated" functions and attributes

* Move
- `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py`
- `gensim.summarization` ➡ `gensim.models.summarization`
- `gensim.topic_coherence` ➡ `gensim.models._coherence`
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`


## 3.4.0, 2018-03-01

### :star2: New features:
* Massive optimizations of `gensim.models.LdaModel`: much faster training, using Cython. (__[@arlenk](https://github.com/arlenk)__, [#1767](https://github.com/RaRe-Technologies/gensim/pull/1767))
- Training benchmark :boom:

| dataset | old LDA [sec] | optimized LDA [sec] | speed up |
|---------|---------------|---------------------|---------|
| nytimes | 3473 | **1975** | **1.76x** |
| enron | 774 | **437** | **1.77x** |

- This change **affects all models that depend on `LdaModel`**, such as `LdaMulticore`, `LdaSeqModel`, `AuthorTopicModel`.
* Huge speed-ups to corpus I/O with `MmCorpus` (Cython) (__[@arlenk](https://github.com/arlenk)__, [#1825](https://github.com/RaRe-Technologies/gensim/pull/1825))
- File reading benchmark

| dataset | file compressed? | old MmReader [sec] | optimized MmReader [sec] | speed up |
|---------------|:-----------:|:------------:|:------------------:|:-------------:|
| enron | no | 22.3 | **2.6** | **8.7x** |
| | yes | 37.3 | **14.4** | **2.6x** |
| nytimes | no | 419.3 | **49.2** | **8.5x** |
| | yes | 686.2 | **275.1** | **2.5x** |
| text8 | no | 25.4 | **2.5** | **10.1x** |
| | yes | 41.9 | **17.0** | **2.5x** |

- Overall, a **2.5x** speedup for compressed `.mm.gz` input and **8.5x** :fire::fire::fire: for uncompressed plaintext `.mm`.

* Performance and memory optimization to `gensim.models.FastText` :rocket: (__[@jbaiter](https://github.com/jbaiter)__, [#1916](https://github.com/RaRe-Technologies/gensim/pull/1916))
- Benchmark (first 500,000 articles from English Wikipedia)

| Metric | old FastText | optimized FastText | improvement |
| -----------------------| -----------------| -------------------|-------------|
| Training time (1 epoch) | 4823.4s (80.38 minutes) | **1873.6s (31.22 minutes)** | **2.57x** |
| Training time (full) | 1h 26min 13s | **36min 43s** | **2.35x** |
| Training words/sec | 72,781 | **187,366** | **2.57x** |
| Training peak memory | 5.2 GB | **3.7 GB** | **1.4x** |

- Overall, a **2.5x** speedup & memory usage reduced by **30%**.

* Implemented [Soft Cosine Measure](https://en.wikipedia.org/wiki/Cosine_similarity#Soft_cosine_measure) (__[@Witiko](https://github.com/Witiko)__, [#1827](https://github.com/RaRe-Technologies/gensim/pull/1827))
- New method for assessing document similarity, a nice faster alternative to [WMD, Word Mover's Distance](http://proceedings.mlr.press/v37/kusnerb15.pdf)
- Benchmark

| Technique | MAP score | Duration |
|-----------|-----------|--------------|
| softcossim| **45.99** | **1.24 sec** |
| wmd-relax | 44.48 | 12.22 sec |
| cossim | 44.22 | 4.39 sec |
| wmd-gensim| 44.08 | 98.29 sec |

- [Soft Cosine notebook with detailed description, examples & benchmarks](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb)
- Related papers:
- [Soft Similarity and Soft Cosine Measure: Similarity of Features in Vector Space Model](http://www.scielo.org.mx/pdf/cys/v18n3/v18n3a7.pdf)
- [SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering](http://www.aclweb.org/anthology/S17-2051)
- [Vector Space Representations in IR](https://github.com/witiko-masters-thesis/thesis/blob/master/main.pdf)


### :+1: Improvements:
* New method to show the Gensim installation parameters: `python -m gensim.scripts.package_info --info`. Use this when reporting problems, for easier debugging. Fix #1902 (__[@sharanry](https://github.com/sharanry)__, [#1903](https://github.com/RaRe-Technologies/gensim/pull/1903))
* Added a flag to optionally skip network-related tests, to help maintainers avoid network issues with CI services (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1930](https://github.com/RaRe-Technologies/gensim/pull/1930))
* Added `license` field to `setup.py`, allowing the use of tools like `pip-licenses` (__[@nils-werner](https://github.com/nils-werner)__, [#1909](https://github.com/RaRe-Technologies/gensim/pull/1909))

### :red_circle: Bug fixes:
* Fix Python 3 compatibility for `gensim.corpora.UciCorpus.save_corpus` (__[@darindf](https://github.com/darindf)__, [#1875](https://github.com/RaRe-Technologies/gensim/pull/1875))
* Add `wv` property to KeyedVectors for backward compatibility. Fix #1882 (__[@manneshiva](https://github.com/manneshiva)__, [#1884](https://github.com/RaRe-Technologies/gensim/pull/1884))
* Fix deprecation warning from `inspect.getargspec`. Fix #1878 (__[@aneesh-joshi](https://github.com/aneesh-joshi)__, [#1887](https://github.com/RaRe-Technologies/gensim/pull/1887))
* Add `LabeledSentence` to `gensim.models.doc2vec` for backward compatibility. Fix #1886 (__[@manneshiva](https://github.com/manneshiva)__, [#1891](https://github.com/RaRe-Technologies/gensim/pull/1891))
* Fix empty output bug in `Phrases` (when using `model[tokens]` twice). Fix #1401 (__[@sj29-innovate](https://github.com/sj29-innovate)__, [#1853](https://github.com/RaRe-Technologies/gensim/pull/1853))
* Fix type problems for `D2VTransformer.fit_transform`. Fix #1834 (__[@Utkarsh-Mishra-CIC](https://github.com/Utkarsh-Mishra-CIC)__, [#1845](https://github.com/RaRe-Technologies/gensim/pull/1845))
* Fix `datatype` parameter for `KeyedVectors.load_word2vec_format`. Fix #1682 (__[@pushpankar](https://github.com/pushpankar)__, [#1819](https://github.com/RaRe-Technologies/gensim/pull/1819))
* Fix deprecated parameters in `doc2vec-lee` notebook (__[@TheFlash10](https://github.com/TheFlash10)__, [#1918](https://github.com/RaRe-Technologies/gensim/pull/1918))
* Fix file-like closing bug in `gensim.corpora.MmCorpus`. Fix #1869 (__[@sj29-innovate](https://github.com/sj29-innovate)__, [#1911](https://github.com/RaRe-Technologies/gensim/pull/1911))
* Fix precision problem in `test_similarities.py`, no more FP fails. (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1928](https://github.com/RaRe-Technologies/gensim/pull/1928))
* Fix encoding in Lee corpus reader. (__[@menshikh-iv](https://github.com/menshikh-iv)__, [#1931](https://github.com/RaRe-Technologies/gensim/pull/1931))
* Fix OOV pairs counter in `WordEmbeddingsKeyedVectors.evaluate_word_pairs`. (__[@akutuzov](https://github.com/akutuzov)__, [#1934](https://github.com/RaRe-Technologies/gensim/pull/1934))


### :books: Tutorial and doc improvements:
* Fix example block for `gensim.models.Word2Vec` (__[@nzw0301](https://github.com/nzw0301)__, [#1870](https://github.com/RaRe-Technologies/gensim/pull/1876))
* Fix `doc2vec-lee` notebook (__[@numericlee](https://github.com/numericlee)__, [#1870](https://github.com/RaRe-Technologies/gensim/pull/1870))
* Store images from `README.md` directly in repository. Fix #1849 (__[@ibrahimsharaf](https://github.com/ibrahimsharaf)__, [#1861](https://github.com/RaRe-Technologies/gensim/pull/1861))
* Add windows venv activate command to `CONTRIBUTING.md` (__[@aneesh-joshi](https://github.com/aneesh-joshi)__, [#1880](https://github.com/RaRe-Technologies/gensim/pull/1880))
* Add anaconda-cloud badge. Partial fix #1901 (__[@sharanry](https://github.com/sharanry)__, [#1905](https://github.com/RaRe-Technologies/gensim/pull/1905))
* Fix docstrings for lsi-related code (__[@steremma](https://github.com/steremma)__, [#1892](https://github.com/RaRe-Technologies/gensim/pull/1892))
* Fix parameter description of `sg` parameter for `gensim.models.word2vec` (__[@mdcclv](https://github.com/mdcclv)__, [#1919](https://github.com/RaRe-Technologies/gensim/pull/1919))
* Refactor documentation for `gensim.similarities.docsim` and `MmCorpus-related`. (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1910](https://github.com/RaRe-Technologies/gensim/pull/1910))
* Fix docstrings for `gensim.test.utils` (__[@yurkai](https://github.com/yurkai)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1904](https://github.com/RaRe-Technologies/gensim/pull/1904))
* Refactor docstrings for `gensim.scripts`. Partial fix #1665 (__[@yurkai](https://github.com/yurkai)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1792](https://github.com/RaRe-Technologies/gensim/pull/1792))
* Refactor API reference `gensim.corpora`. Partial fix #1671 (__[@CLearERR](https://github.com/CLearERR)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1835](https://github.com/RaRe-Technologies/gensim/pull/1835))
* Fix documentation for `gensim.models.wrappers` (__[@kakshay21](https://github.com/kakshay21)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1859](https://github.com/RaRe-Technologies/gensim/pull/1859))
* Fix docstrings for `gensim.interfaces` (__[@yurkai](https://github.com/yurkai)__ & __[@menshikh-iv](https://github.com/menshikh-iv)__, [#1913](https://github.com/RaRe-Technologies/gensim/pull/1913))


### :warning: Deprecations (will be removed in the next major release)
* Remove
- `gensim.models.wrappers.fasttext` (obsoleted by the new native `gensim.models.fasttext` implementation)
- `gensim.examples`
- `gensim.nosy`
- `gensim.scripts.word2vec_standalone`
- `gensim.scripts.make_wiki_lemma`
- `gensim.scripts.make_wiki_online`
- `gensim.scripts.make_wiki_online_lemma`
- `gensim.scripts.make_wiki_online_nodebug`
- `gensim.scripts.make_wiki` (all of these obsoleted by the new native `gensim.scripts.segment_wiki` implementation)
- "deprecated" functions and attributes

* Move
- `gensim.scripts.make_wikicorpus` ➡ `gensim.scripts.make_wiki.py`
- `gensim.summarization` ➡ `gensim.models.summarization`
- `gensim.topic_coherence` ➡ `gensim.models._coherence`
- `gensim.utils` ➡ `gensim.utils.utils` (old imports will continue to work)
- `gensim.parsing.*` ➡ `gensim.utils.text_utils`


## 3.3.0, 2018-01-02

:star2: New features:
Expand Down
4 changes: 2 additions & 2 deletions docs/src/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,9 +55,9 @@
# built documents.
#
# The short X.Y version.
version = '3.4'
version = '3.5'
# The full version, including alpha/beta/rc tags.
release = '3.4.0'
release = '3.5.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
Loading

0 comments on commit 19e725d

Please sign in to comment.