Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) #4807

Closed
wants to merge 4 commits into from

Conversation

EntilZha
Copy link
Contributor

JIRA: https://issues.apache.org/jira/browse/SPARK-5556

As discussed in that issue, it would be great to have multiple LDA algorithm options, principally EM (implemented already in #4047) and Gibbs.

Goals of PR:

  1. Refactor LDA to allow multiple algorithm options (done)
  2. Refactor Gibbs code here to this interface (mostly done): https://github.com/EntilZha/spark/tree/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling
  3. Run the same performance tests run for the EM PR for comparison (todo, initial smaller tests have been run)

At the moment, I am looking for feedback on the refactoring while working on putting the Gibbs code in.

Summary of Changes:

These traits were created with the purpose of encapsulating everything about implementation, while interfacing with the entry point LDA.run and DistributedLDAModel.

private[clustering] trait LearningState {
    def next(): LearningState
    def topicsMatrix: Matrix
    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    def logLikelihood: Double
    def logPrior: Double
    def topicDistributions: RDD[(Long, Vector)]
    def globalTopicTotals: LDA.TopicCounts
    def k: Int
    def vocabSize: Int
    def docConcentration: Double
    def topicConcentration: Double
    def deleteAllCheckpoints(): Unit
  }

  private[clustering] trait LearningStateInitializer {
    def initialState(
      docs: RDD[(Long, Vector)],
      k: Int,
      docConcentration: Double,
      topicConcentration: Double,
      randomSeed: Long,
      checkpointInterval: Int): LearningState
  }

The entirety of an LDA implementation can be captured by an object and class which extend these traits. Specifically, the LearningStateInitializer provides the method for returning the LearningState which maintains state.

Lastly, the algorithm can be set via an enum which is pattern matched to create the correct thing. My thought is the default algorithm should be whichever performs better.

Gibbs Implementation

Old design doc is here:
Primary Gibbs algorithm from here (mostly notation/math, GraphX based, not table based): http://www.cs.ucsb.edu/~mingjia/cs240/doc/273811.pdf
Implements FastLDA from here: http://www.ics.uci.edu/~newman/pubs/fastlda.pdf

Specific Points for Feedback

  1. Naming is hard, and I'me not sure if the traits are named appropriately
  2. Similarly, I am reasonably familiar with the Scala type system, but perhaps there is some ninja tricks I don't know that would be helpful
  3. General interface/cleanliness
  4. Should the LearningStates/etc go within LDA, I think so, thoughts?
  5. Anything else, I'me also learning here.

@mengxr
Copy link
Contributor

mengxr commented Feb 27, 2015

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Feb 27, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28051 has finished for PR 4807 at commit 34d5853.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LearningStateInitializer
    • class EMLearningState(optimizer: EMOptimizer) extends LearningState

@SparkQA
Copy link

SparkQA commented Feb 27, 2015

Test build #28052 has finished for PR 4807 at commit e895f38.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@witgo
Copy link
Contributor

witgo commented Mar 10, 2015

@EntilZha @mengxr This branch can be merged into master?
I want merge the PR to LightLDA and lda_Gibbs.

@EntilZha
Copy link
Contributor Author

As is, it can be merged (as far as work on refactoring goes). I had
actually considered having separate PRs for the Refactor and Gibbs anyway.
On Mar 10, 2015 9:22 AM, "Guoqiang Li" [email protected] wrote:

@EntilZha https://github.com/EntilZha @mengxr
https://github.com/mengxr This branch can be merged into master?
I want merge the PR to LightLDA
https://github.com/witgo/spark/tree/LightLDA and lda_Gibbs
https://github.com/witgo/spark/tree/lda_Gibbs.


Reply to this email directly or view it on GitHub
#4807 (comment).

@witgo
Copy link
Contributor

witgo commented Mar 11, 2015

@EntilZha thx.
@mengxr what do you think?

private[clustering] trait LearningState {
def next(): LearningState
def topicsMatrix: Matrix
def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, right? Should be removed from the LearningState class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it not necessary? The LDASuite which contains the Distributed/LocalModels calls it. How they are created, is up to the specific implementation of LDA. Could you be more specific why its not necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different implementations can be hidden by topicsMatrix, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for a separate method is twofold. First, although you could calculate it from topicsMatrix in theory, the size of topicsMatrix could be very large (too large to fit in the driver memory, as the docs warn). The describeTopics is intended to provide an interface for the implementation to extract a topics matrix bounded to only the top maxTermsPerTopic topics. It is less likely this runs the driver out of memory and keeps computation of the top n topics distributed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about def topicsMatrix: Matrix => def termTopicDistributions: RDD[(Long, Vector)]?

def topicDistributions: RDD[(Long, Vector)] => def docTopicDistributions: RDD[(Long,Vector)]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably @jkbradley can weigh in here. I think both changes seem reasonable, then have the Matrix computed from the RDD. If there is agreement, i can make the change on the PR.

@witgo
Copy link
Contributor

witgo commented Mar 12, 2015

@jkbradley, @mengxr Whether have time to look at here?

@jkbradley
Copy link
Member

I apologize for the lack of response! I'm going to try to make a pass ASAP

@EntilZha
Copy link
Contributor Author

EntilZha commented Apr 1, 2015

Sounds good. I think its reasonable that this PR only includes refactoring, not Gibbs. Then evaluate LightLDA vs FastLDA and choose which one makes sense. If changes look good, I will go ahead and make the changes proposed by @witgo

@jkbradley
Copy link
Member

@EntilZha I made a quick pass to get a sense of the structure. My main comment is that these changes seem to mix the concepts of algorithm and model together. I think that's why @witgo was confused about putting describeTopics within the learning state; that confused me a lot too. You are right that we need some changes to the code to support other algorithms, but I'd like to keep the separation between algorithms and models. Here's what I'd propose:

Models:

  • abstract class LDAModel
  • class LocalLDAModel extends LDAModel
    • This should be the de facto local LDAModel. If algorithms store extra info with a local model, they can extend this class.
  • trait WithTrainingResults
    • This trait indicates that a model stores results for the whole training corpus. This is important to store of course since recomputing it will not necessarily get the same results. But it will also be useful to allow users to thrown out this extra info if they don't want to store it.
  • class EMDistributedLDAModel extends LDAModel with WithTrainingResults
    • This (and other extensions of LDAModel) can be converted to a LocalLDAModel.
  • class GibbsDistributedLDAModel extends LDAModel with WithTrainingResults
  • Notes
    • I'm not listing a DistributedLDAModel trait since I don't think it would share anything across different algorithms' models.

For algorithms, I like keeping everything in class LDA. That can make calls out to Optimizers (1 for each algorithm), and the Optimizers can return an instance of LDAModel. I'd eliminate LearningState and instead put everything in Optimizer, including initialization. I don't see a need for multiple classes since they belong under the same concept.

I'm sorry I took so long to respond, but I'll keep up-to-date from here out. I hadn't realized quite so much was blocking on this PR.

Also, I know I'm suggesting significant changes to the PR, but it should actually require fewer changes to master and still allow multiple algorithms.

CC: @hhbyyh since you have investment in this via [https://github.com//pull/4419]. I believe OnlineLDA could fit under these abstractions.

@hhbyyh
Copy link
Contributor

hhbyyh commented Apr 3, 2015

@jkbradley Thanks for the proposal and it looks reasonable.

Just two minor things not clear,

  1. Should different algorithms have different entrance in LDA, like runGibbs, runOnline, runEM? I kinda like it as the separation looks simple and clear.
  2. Online LDA have several specific arguments. What's the recommended place to put them and their getter/setter, in LDA or optimizer ?

I'm good with other parts. Thanks.

@EntilZha
Copy link
Contributor Author

EntilZha commented Apr 3, 2015

Just double checking, your suggestion would be to revert to master, implement those general changes, then commit/push the modified branch?

Primary reason I did it this way was to refactor/abstract along the method boundaries that exist right now, but as you noted it does mix model/algorithm. I like your approach on extending the abstract class with trait. I haven't taken much time to work on it, but could do that over the next couple days. I also plan on being at Databricks on Wednesday for the training if you want to chat then.

@jkbradley
Copy link
Member

Here's a proposal. Let me know what you think!

@hhbyyh

  1. Should different algorithms have different entrance in LDA, like runGibbs, runOnline, runEM? I kinda like it as the separation looks simple and clear.

Multiple run methods do make that separation clearer, but they also force beginner users (who don't know what these algorithms are) to choose an algorithm before they can try LDA. I'd prefer to keep a single run() method and specify the algorithm as a String parameter.

One con of a single run() method is that users will get back an LDAModel which they will need to cast to a more specific type (if they want to use the specialization's extra functionality). I think we could eliminate this issue later on by opening up each algorithm as its own Estimator (so that LDA would become a meta-Estimator, if you will).

  1. Online LDA have several specific arguments. What's the recommended place to put them and their getter/setter, in LDA or optimizer ?

That is an issue, for sure. I'd propose:

trait Optimizer // no public API
class EMOptimizer extends Optimizer {
  // public API: getters/setters for EM-specific parameters
  // private[mllib] API: methods for learning
}
class LDA {
  def setOptimizer(optimizer: String) // takes "EM" / "Gibbs" / "online"
  def setOptimizer(optimizer: Optimizer) // takes Optimizer instance which user can configure beforehand
  def getOptimizer: Optimizer
}

For users, Optimizer classes simply store algorithm-specific parameters. Users can use the default Optimizer, or they can specify the optimizer via String (with default algorithm parameters) or via Optimizer (with configured algorithm parameters).

@EntilZha It might be easiest to revert to master (to make diffs easier), but you can decide. That would be great if you have time to work on it in the next couple of days, thanks. I'll be out of town (but online) Wednesday unfortunately, but I hope it goes well!

@hhbyyh
Copy link
Contributor

hhbyyh commented Apr 6, 2015

Thanks for the reply. And the ideas looks good to me. I'll go ahead with the correctness verification.

@jkbradley
Copy link
Member

@EntilZha @hhbyyh Ping. Please let me know if there are items I can help with!

Also, I thought of one more issue, which is that we'll have to make sure that the trait Optimizer API is Java-friendly. I think it can be, but we'll have to verify.

@hhbyyh
Copy link
Contributor

hhbyyh commented Apr 17, 2015

Hi @jkbradley .
I've finished the correctness test against Blei's implementation, more details are in #4419.

I tried to refactor LDA as proposed and find it will be a big change (ut and examples included).
First let's confirm on the change:

  1. All Optimizers will be public, as users would be able to specify parameters.
  2. return value for LDA.run will become LDAModel (now it's DistributedLDAModel)
  3. users can first create optimizer and then pass it to LDA.setOptimizer
  4. All the parameters specific to one algorithm should go into optimizer.

A few questions:
Based on previous discussion, users can specify algorithms by

  1. passing String parameter to LDA.run. (hard to specify parameter)
  2. or through setOptimizer of LDA. (a little tricky to set default Optimizer)

I think it's better to leave only one place for determining which algorithm to use, since otherwise it can be confusing and conflicted.

Another question is about existing parameters in LDA:
Except K, all other parameters (Alpha, Beta, Maxiteration, seed, checkPointInterval) are useless or have different default values for Online LDA. I'm not sure if we should move all those parameters to EM optimizer.

Actually I find LDA and OnlineLDA share quite few things and it's kind of difficult to merge them together. Maybe for OnlineLDA, separating it to another File is a better choice . (Later I'll provide an interface / example for stream).

@jkbradley
Copy link
Member

@hhbyyh I agree with points 1-4.

One clarification:
{quote}
A few questions:
Based on previous discussion, users can specify algorithms by

  1. passing String parameter to LDA.run. (hard to specify parameter)
  2. or through setOptimizer of LDA. (a little tricky to set default Optimizer)
    {quote}
    Not quite. This is what I had in mind for LDA optimizer parameter setting:
def setOptimizer(value: Optimizer): LDA = ???  // set via Optimizer, constructed beforehand
def setOptimizer(value: String): LDA = ???  // set with optimizer name ("em"), using default optimizer parameters

I did not intend for us to pass extra parameters to the run() method.

I'll respond to the Online LDA-specific items in the other PR: [https://github.com//pull/4419]

asfgit pushed a commit that referenced this pull request Apr 28, 2015
… extensibility

jira: https://issues.apache.org/jira/browse/SPARK-7090

LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley jkbradley proposed in #4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.

Concrete changes:

1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.

2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
        -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
        -move the code from LDA.initalState to initalState of EMLDAOptimizer

3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.

4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.

Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.

Author: Yuhao Yang <[email protected]>

Closes #5661 from hhbyyh/ldaRefactor and squashes the following commits:

0e2e006 [Yuhao Yang] respond to review comments
08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
e756ce4 [Yuhao Yang] solve mima exception
d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
0bb8400 [Yuhao Yang] refactor LDA with Optimizer
ec2f857 [Yuhao Yang] protoptype for discussion
@jkbradley
Copy link
Member

@EntilZha I think this PR doesn't need to be updated now that [https://github.com//pull/5661] has been merged (for JIRA [https://issues.apache.org/jira/browse/SPARK-7090]). Thank you though for this initial PR and discussion! Could you please close this PR? It will still be great if we can get Gibbs sampling in for the next release cycle

@asfgit asfgit closed this in 555213e Apr 28, 2015
@EntilZha
Copy link
Contributor Author

Commenting here and then on ticket. If there is interest in using the Gibbs implementation I wrote for next release using the interface/Refactor from that PR I am open to that.

@jkbradley
Copy link
Member

Definitely interest; let's coordinate on the JIRA and a new PR, especially with @witgo

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
… extensibility

jira: https://issues.apache.org/jira/browse/SPARK-7090

LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.

Concrete changes:

1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.

2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
        -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
        -move the code from LDA.initalState to initalState of EMLDAOptimizer

3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.

4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.

Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.

Author: Yuhao Yang <[email protected]>

Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits:

0e2e006 [Yuhao Yang] respond to review comments
08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
e756ce4 [Yuhao Yang] solve mima exception
d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
0bb8400 [Yuhao Yang] refactor LDA with Optimizer
ec2f857 [Yuhao Yang] protoptype for discussion
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
… extensibility

jira: https://issues.apache.org/jira/browse/SPARK-7090

LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.

Concrete changes:

1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.

2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
        -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
        -move the code from LDA.initalState to initalState of EMLDAOptimizer

3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.

4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.

Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.

Author: Yuhao Yang <[email protected]>

Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits:

0e2e006 [Yuhao Yang] respond to review comments
08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
e756ce4 [Yuhao Yang] solve mima exception
d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
0bb8400 [Yuhao Yang] refactor LDA with Optimizer
ec2f857 [Yuhao Yang] protoptype for discussion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants