Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EnsembleLda #2282

Closed
wants to merge 133 commits into from
Closed

EnsembleLda #2282

wants to merge 133 commits into from

Conversation

sezanzeb
Copy link
Contributor

@sezanzeb sezanzeb commented Dec 3, 2018

Tests

We have some tests that we can also commit when this PR is of interest, I will polish them and embed them into the unittest module in that case.

They are commited.

Materials

As requested by @piskvorky, here are the materials that describe our module:
eLDA_algo_overview.pdf
eLDA_motivation.pdf
eLDA_when_to_use.pdf

Code Review

There are two places in "gensim/models/ensemblelda.py" for which I especially would like to have review:

  • lines 611 and 784: The error handling of multiprocessing
  • line 1124 and following: The way I provide functions crucial to gensims ldamodel api. Those functions just forward the calls to an internal object. So are they too redundant or do you think they make things easier? Is there a better way to do so?
  • And another one for the "tox -e flake8" test, which does not pass here:
gensim/corpora/opinosiscorpus.py:78:22: W605 invalid escape sequence '\w'
                    processed = [
  • would you prefer the implemention not to use pandas, since it has not been a dependency of gensim before?

@sezanzeb sezanzeb changed the title Ensemble lda EnsembleLda Dec 3, 2018
@menshikh-iv menshikh-iv added the interesting PR ⭐ Interesting PR topic, but not ready (need much work to finish) label Jan 11, 2019
@menshikh-iv
Copy link
Contributor

CC @piskvorky, wdyt about this feature (not PR itself), is it a good idea to add it to gensim?

@menshikh-iv
Copy link
Contributor

ping @piskvorky

@piskvorky
Copy link
Owner

piskvorky commented Jan 29, 2019

Definitely! Getting more robust LDA topics sounds like an awesome feature (and the accompanying descriptions/motivations are exemplary – we can point to this in future PRs).

I didn't have time to review the PR code, but yes, we prefer to avoid pandas (complex dependency, unclear memory semantics, encourages bad engineering patterns). Can you use normal list/dict comprehensions instead? Cheers.

@SophiaGoldberg
Copy link

Minor fix: when running gensim/docs/notebooks/Opinosis.ipynb we require that gensim/gensim/corpora/__init__.py contains the line of code: from .opinosiscorpus import OpinosisCorpus.

@sezanzeb
Copy link
Contributor Author

sezanzeb commented Apr 6, 2019

Definitely! Getting more robust LDA topics sounds like an awesome feature (and the accompanying descriptions/motivations are exemplary – we can point to this in future PRs).

I didn't have time to review the PR code, but yes, we prefer to avoid pandas (complex dependency, unclear memory semantics, encourages bad engineering patterns). Can you use normal list/dict comprehensions instead? Cheers.

I'm working on it and will let you know when I'm done.

@sezanzeb
Copy link
Contributor Author

sezanzeb commented Apr 13, 2019

EDIT: avoided the problem in the tests.

Taking the distance matrix from the reference and continuing to work with that, then checking if the results are as expected. Checking the distance matrix only roughly to avoid failuring on (rounding?) problems with the cosine distance (maybe because it's due to the vectors being very long?).

The pandas dependency is completely removed btw.


@piskvorky

The tests I have supplied work locally in the following environments:

4.19.32-1-MANJARO

  • python 2.7.16
  • python 3.7.2

Docker container with Ubuntu 18.04

  • python 2.7.15rc1
  • python 3.6.7

Docker container with travisci/ci-garnet:packer-1515445631-7dfb2e1

  • Python 2.7.6

Online it fails in 2.7, 3.5 and 3.6, but passes in 3.7

I don't see anything in the CBDBSCAN class that would not be deterministic and that could be different in the online testing environment. Other than that, it might be floating point precision, as sorting on the distance matrix is done prior to clustering. The clustering results also show that the sorting is different most likely. So would it be better to avoid this kind of test? To store matrixes like that into test_data directory in order check if they are still the same after modifying the code, when it might change depending on the architecture? I also observed quite some significant differences in float calculations between 2.7 and 3.6, so this seems somewhat likely to cause the problem.

(see line 1188 and below) https://github.com/RaRe-Technologies/gensim/blob/6dc60014b96dadc2109a805f4d1f1c5c836675c1/gensim/models/ensemblelda.py#L1188

Here is the test that compares training results to a pretrained model, which is included in the package and which fails in travisci (line 61 to be specific): https://github.com/RaRe-Technologies/gensim/blob/3ec31e7b5a57addca9945d3ed787a42801167f54/gensim/test/test_ensemblelda.py#L37

@sezanzeb
Copy link
Contributor Author

sezanzeb commented Jan 23, 2020

@piskvorky

@aloosley and me think this is ready to merge (if the tests pass)

Copy link
Collaborator

@mpenkov mpenkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking very good.

Left you some comments. Please have a look and let me know when you're ready for another review.

In general:

  • Tighter scope for context managers and try-catch
  • PEP8
  • gensim-specific formatting nitpicks (hanging indents, avoid .format, etc)
  • Remove code comments that add little value.

See the comments for details.

I made it around halfway through ensemblelda.py before I ran out of time - I'll check the rest next time.

gensim/corpora/opinosiscorpus.py Outdated Show resolved Hide resolved
gensim/corpora/opinosiscorpus.py Outdated Show resolved Hide resolved
gensim/corpora/opinosiscorpus.py Outdated Show resolved Hide resolved
gensim/corpora/opinosiscorpus.py Outdated Show resolved Hide resolved
gensim/corpora/opinosiscorpus.py Outdated Show resolved Hide resolved
gensim/models/ensemblelda.py Outdated Show resolved Hide resolved
gensim/models/ensemblelda.py Outdated Show resolved Hide resolved
gensim/models/ensemblelda.py Outdated Show resolved Hide resolved
gensim/models/ensemblelda.py Outdated Show resolved Hide resolved
gensim/models/ensemblelda.py Outdated Show resolved Hide resolved
@sezanzeb
Copy link
Contributor Author

sezanzeb commented Feb 6, 2020

Thanks for your efforts in reviewing our stuff!

haven't checked if opinosis still works after the changes, don't merge yet. Gonna check the following weekend or something

@sezanzeb
Copy link
Contributor Author

sezanzeb commented Feb 9, 2020

looks good. @mpenkov it's ready for another review

@sezanzeb
Copy link
Contributor Author

@piskvorky @mpenkov any updates?

@piskvorky
Copy link
Owner

Sorry, the beginning of the year was super busy. We have a bunch of PRs lined up, we'll get to this with @mpenkov shortly! Thanks.

@sezanzeb
Copy link
Contributor Author

Sorry, the beginning of the year was super busy. We have a bunch of PRs lined up, we'll get to this with @mpenkov shortly! Thanks.

Ah I see, don't worry at all!

@mpenkov mpenkov self-assigned this Jun 10, 2020
@sezanzeb
Copy link
Contributor Author

I'm going to reopen this PR from my own fork, because I don't get access to the repo anymore, possibly due to organizational changes within that company.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interesting PR ⭐ Interesting PR topic, but not ready (need much work to finish)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants