Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added EnsembleLda for stable LDA topics #2980

Merged
merged 211 commits into from
Jul 22, 2021
Merged

Conversation

sezanzeb
Copy link
Contributor

@sezanzeb sezanzeb commented Oct 14, 2020

Reopened #2282 because I lost access to the repo there.

I'll merge the up to date develop now and I'm not sure if the required documentation was provided, I'll check that (#2673).

sezanzeb and others added 30 commits December 1, 2018 19:27
@sezanzeb
Copy link
Contributor Author

sezanzeb commented Jul 3, 2021

Didn't work out on Friday and also probably not today, but we will be on it shortly again and finish up what is still open now

@sezanzeb
Copy link
Contributor Author

sezanzeb commented Jul 4, 2021

Tests should pass now.

We definitely think that dataclasses and typing would create a much cleaner codebase even though they were not directly asked for in the review, but doing it correctly would result in somewhat large changes. Therefore we decided to leave out such changes for a future PR.

We are looking forward to contributing this to version 4.1 and are waiting for your feedback now.

@sezanzeb
Copy link
Contributor Author

sezanzeb commented Jul 5, 2021

@aloosley added two simple dataclasses Topic and Cluster now, it is already a lot better and wasn't much work after all. I think this is allright

tests are failing because

Reading package lists...
E: Failed to fetch https://dl.bintray.com/sbt/debian/InRelease  403  Forbidden [IP: 52.88.131.165 443]
E: The repository 'https://dl.bintray.com/sbt/debian  InRelease' is not signed.

whatever https://dl.bintray.com/sbt/debian is, it is not accessible. Maybe this is a temporary issue

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 17, 2021

Note to self: need to work around bintray being sunset.

https://status.bintray.com/

Decoupled multiprocessing code from EnsembleLda class.  This reduces the
length of the class by several hundred lines, making it slightly easier
to understand.

Added _generate_topic_models_worker function to clarify distinction
between single-process and multi-process code.

Fixed flake8 problem (l is an ambiguous variable name)

Adjusted _teardown function (removed i parameter, it's only for logs)

Moved _MAX_RANDOM_STATE to module level
@mpenkov
Copy link
Collaborator

mpenkov commented Jul 18, 2021

@aloosley @sezanzeb Finally got a chance to sit down and have a final look at this.

Everything looks good, with the exception that the EnsembleLda class was still a bit too busy: on top of the actual model functionality, it's dealing with ton of multiprocessing stuff. This makes it difficult to understand what's going on. I mentioned this earlier but it may have been lost in the rest of the comments.

Anyway, to help things move along, I made the changes myself. I hope you guys don't mind. All the EnsembleLda tests pass locally (they're still running in CI as I write this) so I don't think I broke anything, but just in case I did, can you please have a glance at the changes here: 71b33dd?

we don't want to hide the details of the problem
@aloosley
Copy link
Contributor

@aloosley @sezanzeb Finally got a chance to sit down and have a final look at this.

Everything looks good, with the exception that the EnsembleLda class was still a bit too busy: on top of the actual model functionality, it's dealing with ton of multiprocessing stuff. This makes it difficult to understand what's going on. I mentioned this earlier but it may have been lost in the rest of the comments.

Anyway, to help things move along, I made the changes myself. I hope you guys don't mind. All the EnsembleLda tests pass locally (they're still running in CI as I write this) so I don't think I broke anything, but just in case I did, can you please have a glance at the changes here: 71b33dd?

Taking a look at the commit, 90% is the excavation of logic from the EnsembleLda class into module level functions. This all looks fine to me and tests pass for me locally as well. @sezanzeb, any objections?

@mpenkov mpenkov merged commit 76579b3 into piskvorky:develop Jul 22, 2021
@mpenkov
Copy link
Collaborator

mpenkov commented Jul 22, 2021

Alright, finally merged this. @sezanzeb @aloosley Thank you for this awesome contribution and your patience!

@sezanzeb
Copy link
Contributor Author

sezanzeb commented Jul 22, 2021

Thank you so much @mpenkov and @piskvorky for your continuous interest in this! I hope the community will find good use for it :)

@piskvorky piskvorky changed the title EnsembleLda Added EnsembleLda for stable LDA topics Jul 22, 2021
@aloosley
Copy link
Contributor

aloosley commented Aug 1, 2021

Alright, finally merged this. @sezanzeb @aloosley Thank you for this awesome contribution and your patience!

Many thanks @mpenkov and @piskvorky for believing in this stable topic modeling idea (turn @sezanzeb's thesis project) and working with us to get it out there for the world to use more easily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interesting PR ⭐ Interesting PR topic, but not ready (need much work to finish)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants