[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) #4807

EntilZha · 2015-02-27T06:00:11Z

JIRA: https://issues.apache.org/jira/browse/SPARK-5556

As discussed in that issue, it would be great to have multiple LDA algorithm options, principally EM (implemented already in #4047) and Gibbs.

Goals of PR:

Refactor LDA to allow multiple algorithm options (done)
Refactor Gibbs code here to this interface (mostly done): https://github.com/EntilZha/spark/tree/LDA-Refactor/mllib/src/main/scala/org/apache/spark/mllib/topicmodeling
Run the same performance tests run for the EM PR for comparison (todo, initial smaller tests have been run)

At the moment, I am looking for feedback on the refactoring while working on putting the Gibbs code in.

Summary of Changes:

These traits were created with the purpose of encapsulating everything about implementation, while interfacing with the entry point LDA.run and DistributedLDAModel.

private[clustering] trait LearningState {
    def next(): LearningState
    def topicsMatrix: Matrix
    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]
    def logLikelihood: Double
    def logPrior: Double
    def topicDistributions: RDD[(Long, Vector)]
    def globalTopicTotals: LDA.TopicCounts
    def k: Int
    def vocabSize: Int
    def docConcentration: Double
    def topicConcentration: Double
    def deleteAllCheckpoints(): Unit
  }

  private[clustering] trait LearningStateInitializer {
    def initialState(
      docs: RDD[(Long, Vector)],
      k: Int,
      docConcentration: Double,
      topicConcentration: Double,
      randomSeed: Long,
      checkpointInterval: Int): LearningState
  }

The entirety of an LDA implementation can be captured by an object and class which extend these traits. Specifically, the LearningStateInitializer provides the method for returning the LearningState which maintains state.

Lastly, the algorithm can be set via an enum which is pattern matched to create the correct thing. My thought is the default algorithm should be whichever performs better.

Gibbs Implementation

Old design doc is here:
Primary Gibbs algorithm from here (mostly notation/math, GraphX based, not table based): http://www.cs.ucsb.edu/~mingjia/cs240/doc/273811.pdf
Implements FastLDA from here: http://www.ics.uci.edu/~newman/pubs/fastlda.pdf

Specific Points for Feedback

Naming is hard, and I'me not sure if the traits are named appropriately
Similarly, I am reasonably familiar with the Scala type system, but perhaps there is some ninja tricks I don't know that would be helpful
General interface/cleanliness
Should the LearningStates/etc go within LDA, I think so, thoughts?
Anything else, I'me also learning here.

… for LDA algorithms

mengxr · 2015-02-27T06:29:25Z

add to whitelist

mengxr · 2015-02-27T06:29:34Z

ok to test

SparkQA · 2015-02-27T06:33:55Z

Test build #28051 has finished for PR 4807 at commit 34d5853.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LearningStateInitializer
- class EMLearningState(optimizer: EMOptimizer) extends LearningState

…should be public but is not showing as public but should be

SparkQA · 2015-02-27T08:20:15Z

Test build #28052 has finished for PR 4807 at commit e895f38.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

witgo · 2015-03-10T16:22:24Z

@EntilZha @mengxr This branch can be merged into master?
I want merge the PR to LightLDA and lda_Gibbs.

EntilZha · 2015-03-10T16:24:49Z

As is, it can be merged (as far as work on refactoring goes). I had
actually considered having separate PRs for the Refactor and Gibbs anyway.
On Mar 10, 2015 9:22 AM, "Guoqiang Li" [email protected] wrote:

@EntilZha https://github.com/EntilZha @mengxr
https://github.com/mengxr This branch can be merged into master?
I want merge the PR to LightLDA
https://github.com/witgo/spark/tree/LightLDA and lda_Gibbs
https://github.com/witgo/spark/tree/lda_Gibbs.

—
Reply to this email directly or view it on GitHub
#4807 (comment).

witgo · 2015-03-11T02:30:01Z

@EntilZha thx.
@mengxr what do you think?

witgo · 2015-03-11T03:40:39Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala

+  private[clustering] trait LearningState {
+    def next(): LearningState
+    def topicsMatrix: Matrix
+    def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])]


This is not necessary, right? Should be removed from the LearningState class.

Why is it not necessary? The LDASuite which contains the Distributed/LocalModels calls it. How they are created, is up to the specific implementation of LDA. Could you be more specific why its not necessary?

Different implementations can be hidden by topicsMatrix, right?

The reason for a separate method is twofold. First, although you could calculate it from topicsMatrix in theory, the size of topicsMatrix could be very large (too large to fit in the driver memory, as the docs warn). The describeTopics is intended to provide an interface for the implementation to extract a topics matrix bounded to only the top maxTermsPerTopic topics. It is less likely this runs the driver out of memory and keeps computation of the top n topics distributed.

How about def topicsMatrix: Matrix => def termTopicDistributions: RDD[(Long, Vector)]?

def topicDistributions: RDD[(Long, Vector)] => def docTopicDistributions: RDD[(Long,Vector)]?

Probably @jkbradley can weigh in here. I think both changes seem reasonable, then have the Matrix computed from the RDD. If there is agreement, i can make the change on the PR.

witgo · 2015-03-12T14:10:37Z

@jkbradley, @mengxr Whether have time to look at here?

jkbradley · 2015-04-01T22:33:41Z

I apologize for the lack of response! I'm going to try to make a pass ASAP

EntilZha · 2015-04-01T22:58:54Z

Sounds good. I think its reasonable that this PR only includes refactoring, not Gibbs. Then evaluate LightLDA vs FastLDA and choose which one makes sense. If changes look good, I will go ahead and make the changes proposed by @witgo

jkbradley · 2015-04-02T17:57:08Z

@EntilZha I made a quick pass to get a sense of the structure. My main comment is that these changes seem to mix the concepts of algorithm and model together. I think that's why @witgo was confused about putting describeTopics within the learning state; that confused me a lot too. You are right that we need some changes to the code to support other algorithms, but I'd like to keep the separation between algorithms and models. Here's what I'd propose:

Models:

abstract class LDAModel
class LocalLDAModel extends LDAModel
- This should be the de facto local LDAModel. If algorithms store extra info with a local model, they can extend this class.
trait WithTrainingResults
- This trait indicates that a model stores results for the whole training corpus. This is important to store of course since recomputing it will not necessarily get the same results. But it will also be useful to allow users to thrown out this extra info if they don't want to store it.
class EMDistributedLDAModel extends LDAModel with WithTrainingResults
- This (and other extensions of LDAModel) can be converted to a LocalLDAModel.
class GibbsDistributedLDAModel extends LDAModel with WithTrainingResults
Notes
- I'm not listing a DistributedLDAModel trait since I don't think it would share anything across different algorithms' models.

For algorithms, I like keeping everything in class LDA. That can make calls out to Optimizers (1 for each algorithm), and the Optimizers can return an instance of LDAModel. I'd eliminate LearningState and instead put everything in Optimizer, including initialization. I don't see a need for multiple classes since they belong under the same concept.

I'm sorry I took so long to respond, but I'll keep up-to-date from here out. I hadn't realized quite so much was blocking on this PR.

Also, I know I'm suggesting significant changes to the PR, but it should actually require fewer changes to master and still allow multiple algorithms.

CC: @hhbyyh since you have investment in this via [https://github.com//pull/4419]. I believe OnlineLDA could fit under these abstractions.

hhbyyh · 2015-04-03T01:55:38Z

@jkbradley Thanks for the proposal and it looks reasonable.

Just two minor things not clear,

Should different algorithms have different entrance in LDA, like runGibbs, runOnline, runEM? I kinda like it as the separation looks simple and clear.
Online LDA have several specific arguments. What's the recommended place to put them and their getter/setter, in LDA or optimizer ?

I'm good with other parts. Thanks.

EntilZha · 2015-04-03T06:40:20Z

Just double checking, your suggestion would be to revert to master, implement those general changes, then commit/push the modified branch?

Primary reason I did it this way was to refactor/abstract along the method boundaries that exist right now, but as you noted it does mix model/algorithm. I like your approach on extending the abstract class with trait. I haven't taken much time to work on it, but could do that over the next couple days. I also plan on being at Databricks on Wednesday for the training if you want to chat then.

jkbradley · 2015-04-03T17:53:16Z

Here's a proposal. Let me know what you think!

@hhbyyh

Should different algorithms have different entrance in LDA, like runGibbs, runOnline, runEM? I kinda like it as the separation looks simple and clear.

Multiple run methods do make that separation clearer, but they also force beginner users (who don't know what these algorithms are) to choose an algorithm before they can try LDA. I'd prefer to keep a single run() method and specify the algorithm as a String parameter.

One con of a single run() method is that users will get back an LDAModel which they will need to cast to a more specific type (if they want to use the specialization's extra functionality). I think we could eliminate this issue later on by opening up each algorithm as its own Estimator (so that LDA would become a meta-Estimator, if you will).

Online LDA have several specific arguments. What's the recommended place to put them and their getter/setter, in LDA or optimizer ?

That is an issue, for sure. I'd propose:

trait Optimizer // no public API
class EMOptimizer extends Optimizer {
  // public API: getters/setters for EM-specific parameters
  // private[mllib] API: methods for learning
}
class LDA {
  def setOptimizer(optimizer: String) // takes "EM" / "Gibbs" / "online"
  def setOptimizer(optimizer: Optimizer) // takes Optimizer instance which user can configure beforehand
  def getOptimizer: Optimizer
}

For users, Optimizer classes simply store algorithm-specific parameters. Users can use the default Optimizer, or they can specify the optimizer via String (with default algorithm parameters) or via Optimizer (with configured algorithm parameters).

@EntilZha It might be easiest to revert to master (to make diffs easier), but you can decide. That would be great if you have time to work on it in the next couple of days, thanks. I'll be out of town (but online) Wednesday unfortunately, but I hope it goes well!

hhbyyh · 2015-04-06T07:56:02Z

Thanks for the reply. And the ideas looks good to me. I'll go ahead with the correctness verification.

jkbradley · 2015-04-16T22:53:13Z

@EntilZha @hhbyyh Ping. Please let me know if there are items I can help with!

Also, I thought of one more issue, which is that we'll have to make sure that the trait Optimizer API is Java-friendly. I think it can be, but we'll have to verify.

hhbyyh · 2015-04-17T10:42:11Z

Hi @jkbradley .
I've finished the correctness test against Blei's implementation, more details are in #4419.

I tried to refactor LDA as proposed and find it will be a big change (ut and examples included).
First let's confirm on the change:

All Optimizers will be public, as users would be able to specify parameters.
return value for LDA.run will become LDAModel (now it's DistributedLDAModel)
users can first create optimizer and then pass it to LDA.setOptimizer
All the parameters specific to one algorithm should go into optimizer.

A few questions:
Based on previous discussion, users can specify algorithms by

passing String parameter to LDA.run. (hard to specify parameter)
or through setOptimizer of LDA. (a little tricky to set default Optimizer)

I think it's better to leave only one place for determining which algorithm to use, since otherwise it can be confusing and conflicted.

Another question is about existing parameters in LDA:
Except K, all other parameters (Alpha, Beta, Maxiteration, seed, checkPointInterval) are useless or have different default values for Online LDA. I'm not sure if we should move all those parameters to EM optimizer.

Actually I find LDA and OnlineLDA share quite few things and it's kind of difficult to merge them together. Maybe for OnlineLDA, separating it to another File is a better choice . (Later I'll provide an interface / example for stream).

jkbradley · 2015-04-17T18:09:15Z

@hhbyyh I agree with points 1-4.

One clarification:
{quote}
A few questions:
Based on previous discussion, users can specify algorithms by

passing String parameter to LDA.run. (hard to specify parameter)
or through setOptimizer of LDA. (a little tricky to set default Optimizer)
{quote}
Not quite. This is what I had in mind for LDA optimizer parameter setting:

def setOptimizer(value: Optimizer): LDA = ???  // set via Optimizer, constructed beforehand
def setOptimizer(value: String): LDA = ???  // set with optimizer name ("em"), using default optimizer parameters

I did not intend for us to pass extra parameters to the run() method.

I'll respond to the Online LDA-specific items in the other PR: [https://github.com//pull/4419]

… extensibility jira: https://issues.apache.org/jira/browse/SPARK-7090 LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley jkbradley proposed in #4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. Concrete changes: 1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm. 2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future) -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite. -move the code from LDA.initalState to initalState of EMLDAOptimizer 3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer. 4. Change the return type of LDA.run from DistributedLDAModel to LDAModel. Further work: add OnlineLDAOptimizer and other possible Optimizers once ready. Author: Yuhao Yang <[email protected]> Closes #5661 from hhbyyh/ldaRefactor and squashes the following commits: 0e2e006 [Yuhao Yang] respond to review comments 08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor e756ce4 [Yuhao Yang] solve mima exception d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor 0bb8400 [Yuhao Yang] refactor LDA with Optimizer ec2f857 [Yuhao Yang] protoptype for discussion

jkbradley · 2015-04-28T02:10:29Z

@EntilZha I think this PR doesn't need to be updated now that [https://github.com//pull/5661] has been merged (for JIRA [https://issues.apache.org/jira/browse/SPARK-7090]). Thank you though for this initial PR and discussion! Could you please close this PR? It will still be great if we can get Gibbs sampling in for the next release cycle

EntilZha · 2015-04-28T21:28:35Z

Commenting here and then on ticket. If there is interest in using the Gibbs implementation I wrote for next release using the interface/Refactor from that PR I am open to that.

jkbradley · 2015-04-28T21:43:28Z

Definitely interest; let's coordinate on the JIRA and a new PR, especially with @witgo

… extensibility jira: https://issues.apache.org/jira/browse/SPARK-7090 LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. Concrete changes: 1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm. 2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future) -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite. -move the code from LDA.initalState to initalState of EMLDAOptimizer 3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer. 4. Change the return type of LDA.run from DistributedLDAModel to LDAModel. Further work: add OnlineLDAOptimizer and other possible Optimizers once ready. Author: Yuhao Yang <[email protected]> Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits: 0e2e006 [Yuhao Yang] respond to review comments 08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor e756ce4 [Yuhao Yang] solve mima exception d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor 0bb8400 [Yuhao Yang] refactor LDA with Optimizer ec2f857 [Yuhao Yang] protoptype for discussion

EntilZha added 2 commits February 25, 2015 23:39

updates to LDA and LDAModel to separate implementation from interface…

4bd1103

… for LDA algorithms

refactored tests to make tests succeed

34d5853

EntilZha added 2 commits February 26, 2015 22:55

fix exposing classes which should be private. LDA.LearningAlgorithms …

cb6cb06

…should be public but is not showing as public but should be

fix style max line length errors

e895f38

witgo reviewed Mar 11, 2015
View reviewed changes

jkbradley mentioned this pull request Apr 3, 2015

[SPARK-5563][mllib] LDA with online variational inference #4419

Closed

hhbyyh mentioned this pull request Apr 23, 2015

[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve extensibility #5661

Closed

asfgit closed this in 555213e Apr 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) #4807

[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) #4807

EntilZha commented Feb 27, 2015

mengxr commented Feb 27, 2015

mengxr commented Feb 27, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

witgo commented Mar 10, 2015

EntilZha commented Mar 10, 2015

witgo commented Mar 11, 2015

witgo Mar 11, 2015

EntilZha Mar 11, 2015

witgo Mar 11, 2015

EntilZha Mar 11, 2015

witgo Mar 11, 2015

EntilZha Mar 11, 2015

witgo commented Mar 12, 2015

jkbradley commented Apr 1, 2015

EntilZha commented Apr 1, 2015

jkbradley commented Apr 2, 2015

hhbyyh commented Apr 3, 2015

EntilZha commented Apr 3, 2015

jkbradley commented Apr 3, 2015

hhbyyh commented Apr 6, 2015

jkbradley commented Apr 16, 2015

hhbyyh commented Apr 17, 2015

jkbradley commented Apr 17, 2015

jkbradley commented Apr 28, 2015

EntilZha commented Apr 28, 2015

jkbradley commented Apr 28, 2015

[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) #4807

[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) #4807

Conversation

EntilZha commented Feb 27, 2015

Goals of PR:

Summary of Changes:

Gibbs Implementation

Specific Points for Feedback

mengxr commented Feb 27, 2015

mengxr commented Feb 27, 2015

SparkQA commented Feb 27, 2015

SparkQA commented Feb 27, 2015

witgo commented Mar 10, 2015

EntilZha commented Mar 10, 2015

witgo commented Mar 11, 2015

witgo Mar 11, 2015

Choose a reason for hiding this comment

EntilZha Mar 11, 2015

Choose a reason for hiding this comment

witgo Mar 11, 2015

Choose a reason for hiding this comment

EntilZha Mar 11, 2015

Choose a reason for hiding this comment

witgo Mar 11, 2015

Choose a reason for hiding this comment

EntilZha Mar 11, 2015

Choose a reason for hiding this comment

witgo commented Mar 12, 2015

jkbradley commented Apr 1, 2015

EntilZha commented Apr 1, 2015

jkbradley commented Apr 2, 2015

hhbyyh commented Apr 3, 2015

EntilZha commented Apr 3, 2015

jkbradley commented Apr 3, 2015

hhbyyh commented Apr 6, 2015

jkbradley commented Apr 16, 2015

hhbyyh commented Apr 17, 2015

jkbradley commented Apr 17, 2015

jkbradley commented Apr 28, 2015

EntilZha commented Apr 28, 2015

jkbradley commented Apr 28, 2015