EnsembleLda #2282

sezanzeb · 2018-12-03T11:06:59Z

Tests

~~We have some tests that we can also commit when this PR is of interest, I will polish them and embed them into the unittest module in that case.~~

They are commited.

Materials

As requested by @piskvorky, here are the materials that describe our module:
eLDA_algo_overview.pdf
eLDA_motivation.pdf
eLDA_when_to_use.pdf

~~Code Review~~

~~There are two places in "gensim/models/ensemblelda.py" for which I especially would like to have review:~~

~~lines 611 and 784: The error handling of multiprocessing~~
line 1124 and following: The way I provide functions crucial to gensims ldamodel api. Those functions just forward the calls to an internal object. So are they too redundant or do you think they make things easier? Is there a better way to do so?
~~And another one for the "tox -e flake8" test, which does not pass here:~~

gensim/corpora/opinosiscorpus.py:78:22: W605 invalid escape sequence '\w'
                    processed = [

~~would you prefer the implemention not to use pandas, since it has not been a dependency of gensim before?~~

…into EnsembleLda

menshikh-iv · 2019-01-17T01:41:17Z

CC @piskvorky, wdyt about this feature (not PR itself), is it a good idea to add it to gensim?

menshikh-iv · 2019-01-29T03:55:56Z

ping @piskvorky

piskvorky · 2019-01-29T13:06:38Z

Definitely! Getting more robust LDA topics sounds like an awesome feature (and the accompanying descriptions/motivations are exemplary – we can point to this in future PRs).

I didn't have time to review the PR code, but yes, we prefer to avoid pandas (complex dependency, unclear memory semantics, encourages bad engineering patterns). Can you use normal list/dict comprehensions instead? Cheers.

…into EnsembleLda

SophiaGoldberg · 2019-03-18T14:58:35Z

Minor fix: when running gensim/docs/notebooks/Opinosis.ipynb we require that gensim/gensim/corpora/__init__.py contains the line of code: from .opinosiscorpus import OpinosisCorpus.

sezanzeb · 2019-04-06T09:07:24Z

Definitely! Getting more robust LDA topics sounds like an awesome feature (and the accompanying descriptions/motivations are exemplary – we can point to this in future PRs).

I didn't have time to review the PR code, but yes, we prefer to avoid pandas (complex dependency, unclear memory semantics, encourages bad engineering patterns). Can you use normal list/dict comprehensions instead? Cheers.

I'm working on it and will let you know when I'm done.

…tion to simply pickling the whole thing.

…into EnsembleLda

sezanzeb · 2019-04-13T10:08:21Z

EDIT: avoided the problem in the tests.

Taking the distance matrix from the reference and continuing to work with that, then checking if the results are as expected. Checking the distance matrix only roughly to avoid failuring on (rounding?) problems with the cosine distance (maybe because it's due to the vectors being very long?).

The pandas dependency is completely removed btw.

@piskvorky

The tests I have supplied work locally in the following environments:

4.19.32-1-MANJARO

python 2.7.16
python 3.7.2

Docker container with Ubuntu 18.04

python 2.7.15rc1
python 3.6.7

Docker container with travisci/ci-garnet:packer-1515445631-7dfb2e1

Python 2.7.6

~~Online it fails in 2.7, 3.5 and 3.6, but passes in 3.7~~

I don't see anything in the CBDBSCAN class that would not be deterministic and that could be different in the online testing environment. Other than that, it might be floating point precision, as sorting on the distance matrix is done prior to clustering. The clustering results also show that the sorting is different most likely. So would it be better to avoid this kind of test? To store matrixes like that into test_data directory in order check if they are still the same after modifying the code, when it might change depending on the architecture? I also observed quite some significant differences in float calculations between 2.7 and 3.6, so this seems somewhat likely to cause the problem.

~~(see line 1188 and below)~~ https://github.com/RaRe-Technologies/gensim/blob/6dc60014b96dadc2109a805f4d1f1c5c836675c1/gensim/models/ensemblelda.py#L1188

~~Here is the test that compares training results to a pretrained model, which is included in the package and which fails in travisci (line 61 to be specific):~~ https://github.com/RaRe-Technologies/gensim/blob/3ec31e7b5a57addca9945d3ed787a42801167f54/gensim/test/test_ensemblelda.py#L37

… differences across architectures

sezanzeb · 2020-01-23T22:05:32Z

@piskvorky

@aloosley and me think this is ready to merge (if the tests pass)

…leLda

mpenkov

This is looking very good.

Left you some comments. Please have a look and let me know when you're ready for another review.

In general:

Tighter scope for context managers and try-catch
PEP8
gensim-specific formatting nitpicks (hanging indents, avoid .format, etc)
Remove code comments that add little value.

See the comments for details.

I made it around halfway through ensemblelda.py before I ran out of time - I'll check the rest next time.

gensim/corpora/opinosiscorpus.py

gensim/models/ensemblelda.py

…singletons to list

…into EnsembleLda

…model. * Added more :meth: and `` `` styling for RST, fixed a few typos

…leLda

sezanzeb · 2020-02-06T22:01:47Z

Thanks for your efforts in reviewing our stuff!

haven't checked if opinosis still works after the changes, don't merge yet. Gonna check the following weekend or something

sezanzeb · 2020-02-09T20:28:52Z

looks good. @mpenkov it's ready for another review

sezanzeb · 2020-04-13T13:26:34Z

@piskvorky @mpenkov any updates?

piskvorky · 2020-04-13T13:42:24Z

Sorry, the beginning of the year was super busy. We have a bunch of PRs lined up, we'll get to this with @mpenkov shortly! Thanks.

sezanzeb · 2020-04-13T19:39:09Z

Sorry, the beginning of the year was super busy. We have a bunch of PRs lined up, we'll get to this with @mpenkov shortly! Thanks.

Ah I see, don't worry at all!

sezanzeb · 2020-10-14T20:50:05Z

I'm going to reopen this PR from my own fork, because I don't get access to the repo anymore, possibly due to organizational changes within that company.

sezanzeb added 2 commits December 1, 2018 19:27

added EnsembleLda

7b73db9

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

241c17e

…into EnsembleLda

sezanzeb changed the title ~~Ensemble lda~~ EnsembleLda Dec 3, 2018

menshikh-iv added the interesting PR ⭐ Interesting PR topic, but not ready (need much work to finish) label Jan 11, 2019

Merge branch 'master' of https://github.com/rare-technologies/gensim …

51945e4

…into EnsembleLda

sezanzeb added 2 commits April 5, 2019 02:31

improvements to add_model, various small changes to comments and code

a67d5db

pandas -> numpy: group by label and mean

e27be0a

sezanzeb added 16 commits April 7, 2019 00:58

pandas -> numpy: generate_stable_topics

83de2dd

pandas -> numpy: distance matrix creation

2af1658

pandas -> numpy: CBDBSCAN

100bbf0

fixes for automated checks

aff3287

improvements on logs, comments and variable naming. Changed save func…

a545ddf

…tion to simply pickling the whole thing.

minor fix in log message format

d1a6854

added tests

3650895

fixed test

00a06e9

removed some dead leftover pandas code from test

f5f1c9c

removed pathlib from test

c32ddad

tests work in python2 locally now

dab067f

Merge branch 'master' of https://github.com/rare-technologies/gensim …

dcc77ef

…into EnsembleLda

updated ensemble test reference model

eb9ea27

passing tox8

6b0dc77

improved determinism of methods

6dc6001

improved order of assertions

3ec31e7

trying to achieve higher precision with float64 to avoid some sorting…

7afd192

… differences across architectures

sezanzeb and others added 3 commits January 23, 2020 23:12

Merge remote-tracking branch 'remotes/original/develop' into EnsembleLda

f63eb03

missing opinosiscorpus.rst file committed

2aabe8f

Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…

f0600dd

…leLda

mpenkov requested changes Jan 30, 2020

View reviewed changes

aloosley and others added 12 commits February 6, 2020 20:35

p names refactored to be descriptive, now using append for appending …

9c25cf5

…singletons to list

Changing to hanging indents where they were not used before

4079c27

Adding :meth: and styling for RST

f5379ff

a bunch of reviews

591cf77

merge

1fbafcb

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

7366495

…into EnsembleLda

* Changed ensemblelda default to use ldamulticore instead of old lda …

f1aba3e

…model. * Added more :meth: and `` `` styling for RST, fixed a few typos

Merge branch 'EnsembleLda' of github.com:DataReply/gensim into Ensemb…

9833e44

…leLda

More docstring polish

a9428de

removing some camel-case vars for pep8 compliance.

3bdeaf2

a bunch of reviews

806952e

merge

e1344bd

fixed linter

3d24f62

mpenkov self-assigned this Jun 10, 2020

sezanzeb mentioned this pull request Sep 29, 2020

added test for sync_state #2959

Merged

sezanzeb mentioned this pull request Oct 14, 2020

Added EnsembleLda for stable LDA topics #2980

Merged

sezanzeb closed this Oct 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EnsembleLda #2282

EnsembleLda #2282

sezanzeb commented Dec 3, 2018 •

edited

Loading

menshikh-iv commented Jan 17, 2019

menshikh-iv commented Jan 29, 2019

piskvorky commented Jan 29, 2019 •

edited

Loading

SophiaGoldberg commented Mar 18, 2019

sezanzeb commented Apr 6, 2019

sezanzeb commented Apr 13, 2019 •

edited

Loading

sezanzeb commented Jan 23, 2020 •

edited

Loading

mpenkov left a comment

sezanzeb commented Feb 6, 2020

sezanzeb commented Feb 9, 2020 •

edited

Loading

sezanzeb commented Apr 13, 2020

piskvorky commented Apr 13, 2020

sezanzeb commented Apr 13, 2020

sezanzeb commented Oct 14, 2020

EnsembleLda #2282

EnsembleLda #2282

Conversation

sezanzeb commented Dec 3, 2018 • edited Loading

menshikh-iv commented Jan 17, 2019

menshikh-iv commented Jan 29, 2019

piskvorky commented Jan 29, 2019 • edited Loading

SophiaGoldberg commented Mar 18, 2019

sezanzeb commented Apr 6, 2019

sezanzeb commented Apr 13, 2019 • edited Loading

sezanzeb commented Jan 23, 2020 • edited Loading

mpenkov left a comment

Choose a reason for hiding this comment

sezanzeb commented Feb 6, 2020

sezanzeb commented Feb 9, 2020 • edited Loading

sezanzeb commented Apr 13, 2020

piskvorky commented Apr 13, 2020

sezanzeb commented Apr 13, 2020

sezanzeb commented Oct 14, 2020

sezanzeb commented Dec 3, 2018 •

edited

Loading

piskvorky commented Jan 29, 2019 •

edited

Loading

sezanzeb commented Apr 13, 2019 •

edited

Loading

sezanzeb commented Jan 23, 2020 •

edited

Loading

sezanzeb commented Feb 9, 2020 •

edited

Loading