Added EnsembleLda for stable LDA topics #2980

sezanzeb · 2020-10-14T20:58:24Z

Reopened #2282 because I lost access to the repo there.

I'll merge the up to date develop now and I'm not sure if the required documentation was provided, I'll check that (#2673).

…into EnsembleLda

…tion to simply pickling the whole thing.

…into EnsembleLda

… differences across architectures

…into EnsembleLda

…eLda

… EnsembleLda

…eLda

sezanzeb · 2021-07-03T19:41:07Z

Didn't work out on Friday and also probably not today, but we will be on it shortly again and finish up what is still open now

sezanzeb · 2021-07-04T20:50:02Z

Tests should pass now.

We definitely think that dataclasses and typing would create a much cleaner codebase even though they were not directly asked for in the review, but doing it correctly would result in somewhat large changes. Therefore we decided to leave out such changes for a future PR.

We are looking forward to contributing this to version 4.1 and are waiting for your feedback now.

sezanzeb · 2021-07-05T21:41:41Z

@aloosley added two simple dataclasses Topic and Cluster now, it is already a lot better and wasn't much work after all. I think this is allright

tests are failing because

Reading package lists...
E: Failed to fetch https://dl.bintray.com/sbt/debian/InRelease  403  Forbidden [IP: 52.88.131.165 443]
E: The repository 'https://dl.bintray.com/sbt/debian  InRelease' is not signed.

whatever https://dl.bintray.com/sbt/debian is, it is not accessible. Maybe this is a temporary issue

mpenkov · 2021-07-17T13:02:01Z

Note to self: need to work around bintray being sunset.

https://status.bintray.com/

Decoupled multiprocessing code from EnsembleLda class. This reduces the length of the class by several hundred lines, making it slightly easier to understand. Added _generate_topic_models_worker function to clarify distinction between single-process and multi-process code. Fixed flake8 problem (l is an ambiguous variable name) Adjusted _teardown function (removed i parameter, it's only for logs) Moved _MAX_RANDOM_STATE to module level

mpenkov · 2021-07-18T13:51:11Z

@aloosley @sezanzeb Finally got a chance to sit down and have a final look at this.

Everything looks good, with the exception that the EnsembleLda class was still a bit too busy: on top of the actual model functionality, it's dealing with ton of multiprocessing stuff. This makes it difficult to understand what's going on. I mentioned this earlier but it may have been lost in the rest of the comments.

Anyway, to help things move along, I made the changes myself. I hope you guys don't mind. All the EnsembleLda tests pass locally (they're still running in CI as I write this) so I don't think I broke anything, but just in case I did, can you please have a glance at the changes here: 71b33dd?

we don't want to hide the details of the problem

aloosley · 2021-07-18T21:59:46Z

@aloosley @sezanzeb Finally got a chance to sit down and have a final look at this.

Everything looks good, with the exception that the EnsembleLda class was still a bit too busy: on top of the actual model functionality, it's dealing with ton of multiprocessing stuff. This makes it difficult to understand what's going on. I mentioned this earlier but it may have been lost in the rest of the comments.

Anyway, to help things move along, I made the changes myself. I hope you guys don't mind. All the EnsembleLda tests pass locally (they're still running in CI as I write this) so I don't think I broke anything, but just in case I did, can you please have a glance at the changes here: 71b33dd?

Taking a look at the commit, 90% is the excavation of logic from the EnsembleLda class into module level functions. This all looks fine to me and tests pass for me locally as well. @sezanzeb, any objections?

mpenkov · 2021-07-22T12:33:11Z

Alright, finally merged this. @sezanzeb @aloosley Thank you for this awesome contribution and your patience!

sezanzeb · 2021-07-22T21:15:05Z

Thank you so much @mpenkov and @piskvorky for your continuous interest in this! I hope the community will find good use for it :)

aloosley · 2021-08-01T22:46:30Z

Alright, finally merged this. @sezanzeb @aloosley Thank you for this awesome contribution and your patience!

Many thanks @mpenkov and @piskvorky for believing in this stable topic modeling idea (turn @sezanzeb's thesis project) and working with us to get it out there for the world to use more easily.

sezanzeb and others added 30 commits December 1, 2018 19:27

added EnsembleLda

7b73db9

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

241c17e

…into EnsembleLda

Merge branch 'master' of https://github.com/rare-technologies/gensim …

51945e4

…into EnsembleLda

improvements to add_model, various small changes to comments and code

a67d5db

pandas -> numpy: group by label and mean

e27be0a

pandas -> numpy: generate_stable_topics

83de2dd

pandas -> numpy: distance matrix creation

2af1658

pandas -> numpy: CBDBSCAN

100bbf0

fixes for automated checks

aff3287

improvements on logs, comments and variable naming. Changed save func…

a545ddf

…tion to simply pickling the whole thing.

minor fix in log message format

d1a6854

added tests

3650895

fixed test

00a06e9

removed some dead leftover pandas code from test

f5f1c9c

removed pathlib from test

c32ddad

tests work in python2 locally now

dab067f

Merge branch 'master' of https://github.com/rare-technologies/gensim …

dcc77ef

…into EnsembleLda

updated ensemble test reference model

eb9ea27

passing tox8

6b0dc77

improved determinism of methods

6dc6001

improved order of assertions

3ec31e7

trying to achieve higher precision with float64 to avoid some sorting…

7afd192

… differences across architectures

better approach for comparing with pretrained model

16d0357

potentially fixing the tests on windows

01b68e4

potentially fixing the tests on windows

9314cb4

Merge branch 'develop' of https://github.com/RaRe-Technologies/gensim …

b282393

…into EnsembleLda

changed citation of opinosis

0b7febc

tox8 test passing after small change on opinosis comments/citation

60a717d

Moving max_random_state inside the model as a private variable.

2ff60ca

removed whitespace

d36fe43

aloosley and others added 6 commits June 30, 2021 22:33

Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…

c4a46e9

…eLda

simplified some function calls to use attributes instead of parameters

531ac6a

Merge branch 'EnsembleLda' of https://github.com/sezanzeb/gensim into…

59270dc

… EnsembleLda

sort key function

0311d94

Merge branch 'EnsembleLda' of github.com:sezanzeb/gensim into Ensembl…

3e00f19

…eLda

more efficient tests with better case names

9617e8f

sezanzeb added 3 commits July 4, 2021 22:18

new reference model

07f5148

updated opinosis example

6dc4bcb

tox

1e5108c

using dataclasses

9ac3439

sezanzeb added 2 commits July 5, 2021 23:43

updated type syntax for docstring

773ce17

unused import

4d674f9

mpenkov added 3 commits July 18, 2021 20:49

update sbt install step

c35fb01

roll back change to docs/src/Makefile

444c190

re-raise caught exception instead of raising a new one

f00aca8

we don't want to hide the details of the problem

add docstring

cac6819

mpenkov merged commit 76579b3 into piskvorky:develop Jul 22, 2021

piskvorky changed the title ~~EnsembleLda~~ Added EnsembleLda for stable LDA topics Jul 22, 2021

sezanzeb mentioned this pull request Oct 2, 2022

Giving missing credit in EnsembleLDA to Alex in docs #3393

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added EnsembleLda for stable LDA topics #2980

Added EnsembleLda for stable LDA topics #2980

sezanzeb commented Oct 14, 2020 •

edited by piskvorky

Loading

sezanzeb commented Jul 3, 2021 •

edited

Loading

sezanzeb commented Jul 4, 2021

sezanzeb commented Jul 5, 2021 •

edited

Loading

mpenkov commented Jul 17, 2021

mpenkov commented Jul 18, 2021

aloosley commented Jul 18, 2021

mpenkov commented Jul 22, 2021

sezanzeb commented Jul 22, 2021 •

edited

Loading

aloosley commented Aug 1, 2021

Added EnsembleLda for stable LDA topics #2980

Added EnsembleLda for stable LDA topics #2980

Conversation

sezanzeb commented Oct 14, 2020 • edited by piskvorky Loading

sezanzeb commented Jul 3, 2021 • edited Loading

sezanzeb commented Jul 4, 2021

sezanzeb commented Jul 5, 2021 • edited Loading

mpenkov commented Jul 17, 2021

mpenkov commented Jul 18, 2021

aloosley commented Jul 18, 2021

mpenkov commented Jul 22, 2021

sezanzeb commented Jul 22, 2021 • edited Loading

aloosley commented Aug 1, 2021

sezanzeb commented Oct 14, 2020 •

edited by piskvorky

Loading

sezanzeb commented Jul 3, 2021 •

edited

Loading

sezanzeb commented Jul 5, 2021 •

edited

Loading

sezanzeb commented Jul 22, 2021 •

edited

Loading