Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5563][mllib] LDA with online variational inference #4419

Closed
wants to merge 40 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Feb 6, 2015

JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for Online LDA based on the research of Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.

Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.

Correctness test.
I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26895 has finished for PR 4419 at commit d640d9c.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class OnlineLDAOptimizer(

s
Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26899 has finished for PR 4419 at commit 26dca1b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 6, 2015

Test build #26901 has finished for PR 4419 at commit f41c5ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27176 has finished for PR 4419 at commit 0d0f3ee.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27177 has finished for PR 4419 at commit 3a06526.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@hhbyyh Thanks for the initial PR! Here are some high-level comments:

  • RDD.sliding(): This may not take much advantage of parallelism. It slides across the RDD by partitions first, meaning that only 1 (or a few) workers will be active on each iteration. For the batch (RDD) setting, I wonder if it would be better to sample. That would mean stochastic gradient descent, and it would hopefully be faster because of the expense of computing the gradient. That would require some testing on an actual cluster to know for sure.
  • local vs. distributed models: The EM implementation supports very large vocabularies, where the topic distributions have to be distributed (the "term" vertices in the Graph). It would be nice if the online LDA could support that too. (I have heard of many use cases involving k and vocabSize large enough that the model would take many GB to store.) However, I realize that storing the model (topics) locally is helpful for efficiency if the model is small enough. Could you please sketch out how we might maintain a distributed model and the costs of doing that?
  • Returning DistributedLDAModel vs. LDAModel: It's true that online LDA should not return the current DistributedLDAModel since DistributedLDAModel has methods for returning info about the full training dataset. That makes me wonder if we should have a different algorithm API for online LDA (OnlineLDA alongside LDA). Does that sound reasonable?
  • code readability (though I know this is a WIP PR right now)
    • It will be helpful to have more comments and organization in the core optimization part of the code for reviewers to understand it.
    • Relatedly, it will be helpful to have the optimization steps (computing the gradient, computing the regularization, making the update, etc.) be separated out. The optimization framework in MLlib is not suitable for you to use yet, probably, but hopefully it will be in the future (after this PR). Separation of parts will help with those future changes.

@SparkQA
Copy link

SparkQA commented Mar 2, 2015

Test build #28168 has finished for PR 4419 at commit 581c623.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 2, 2015

@jkbradley. I was on vacation last two weeks. Really appreciate the detailed comments and I know how time consuming it can be.

  • About batch split. I used docId % batchNumber to split documents into batchNumber batches in the new commit. Will that work? I'm not sure I understand how stochastic gradient descent help in this case.
  • local vs. distributed models: Indeed capacity of current implementation is limited by the local matrix (lambda: vocabSize * k < 2 ^31 - 1). Since online LDA don't need to hold the entire corpus, documents number is not a concern. In each seqop of the aggregate, matrix in calculation is bound to k * ids, where ids is the number of terms in each document. So the problem is how to resolve the limitation on lambda. My initial idea is to support local matrix for now and add support for distributed matrix in the future. I'll explore the upper limit for the current local matrix. (scale estimation is 100000 (vocab) * 1000 (topics), no limit on documents number)

I made some changes according to the last two points. Not sure about how to fit current version to the optimization steps. I thought the code is only for LDA and hard to be used in other context. Is there any example I can refer to? Thanks a lot.

@SparkQA
Copy link

SparkQA commented Mar 2, 2015

Test build #28174 has finished for PR 4419 at commit e271eb1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Thanks for the updates! Responding:

About batch split. I used docId % batchNumber to split documents into batchNumber batches in the new commit. Will that work? I'm not sure I understand how stochastic gradient descent help in this case.

That should help distribute the work; it will be good to see numbers about whether subsampling speeds things up enough. (I mentioned SGD because you could take a random sample on each iteration, rather than a deterministic sample. You wouldn't be able to use the other SGD code in MLlib, but a random sample would effectively be doing mini-batch SGD. That might be a bit better since stochasticity is usually helpful in these non-convex problems.)

My initial idea is to support local matrix for now and add support for distributed matrix in the future.

That sounds good. I don't think you need to implement a distributed version in this PR, but it will be good to think about to make sure we can later generalize to a distributed version without breaking APIs.

Not sure about how to fit current version to the optimization steps. I thought the code is only for LDA and hard to be used in other context. Is there any example I can refer to?

There's a nice explanation in Section 2.3 of the original paper: Online Learning for Latent Dirichlet Allocation. I haven't thought carefully about whether this affects computation, but I think it'd be doable. Don't bother, though, if it makes the code harder to understand; I mainly hope it will make the code easier to understand.

I'll try to make another close pass soon!

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Mar 3, 2015

how about randomSplit for batch split?

And you may refer to the python version on http://www.cs.princeton.edu/~mdhoffma/ to better understand the code. I try to stick to the original paper and implementation to ensure correctness.

@jkbradley
Copy link
Member

I'd recommend RDD.sample() with replacement for sampling.

@jkbradley
Copy link
Member

As far as understanding the code, it's really more for the benefit of future developers than for me. Sticking with the layout in Hoffman's code is fine with me, but I suspect we'll refactor to use general gradient-based optimization methods at some point in the future.

@SparkQA
Copy link

SparkQA commented Apr 30, 2015

Test build #31374 has finished for PR 4419 at commit 68c2318.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
    • class OnlineLDAOptimizer extends LDAOptimizer
  • This patch does not change any dependencies.

submitMiniBatch(batch)
}


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala style: remove extra newline

@jkbradley
Copy link
Member

@hhbyyh Thanks for the updates! Apologies for the delayed review; I just got off a flight. I just made a few tiny comments and will try to make a final pass later today or early tomorrow.


// Train a model
OnlineLDAOptimizer op = new OnlineLDAOptimizer().setTau_0(1024).setKappa(0.51)
.setGammaShape(1e40).setMiniBatchFraction(0.5);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java style: 2 space indentation (everywhere)

@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 1, 2015

Thanks Joseph. Take your time. I'll update according to your comments first.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31496 has finished for PR 4419 at commit 54cf8da.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
    • class OnlineLDAOptimizer extends LDAOptimizer
  • This patch does not change any dependencies.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 1, 2015

get a wired mima exception from Spark Sql, merge master code and retry.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31500 has finished for PR 4419 at commit cf0007d.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
    • class OnlineLDAOptimizer extends LDAOptimizer
  • This patch does not change any dependencies.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 1, 2015

There was a mima exception from Spark SQL and it has been cleared with: beeafcf .
Need a retest on this.

@SparkQA
Copy link

SparkQA commented May 1, 2015

Test build #31571 has finished for PR 4419 at commit 6149ca6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
    • class OnlineLDAOptimizer extends LDAOptimizer

@jkbradley
Copy link
Member

@hhbyyh I'm making a final pass. I'd like to send 1 final clean-up PR based on viewing the generated Java/Scala docs.

Also, could you please update the PR title? Thanks!

…cessors. Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter.

Change miniBatchFraction default to 0.05 to match maxIterations.

Added a little doc.

Changed end of main online LDA update code to avoid the kron() call.  Please confirm if you agree that should be more efficient (not explicitly instantiating a big matrix).

Changed Gamma() to use random seed.

Scala style updates
@hhbyyh hhbyyh changed the title [SPARK-5563][mllib] online lda initial checkin [SPARK-5563][mllib] Add OnlineLDAOptimizer to LDA (with UT) May 3, 2015
@hhbyyh hhbyyh changed the title [SPARK-5563][mllib] Add OnlineLDAOptimizer to LDA (with UT) [SPARK-5563][mllib] LDA with online variational inference May 3, 2015
Various cleanups, use random seed, optimization
@SparkQA
Copy link

SparkQA commented May 3, 2015

Test build #31674 has finished for PR 4419 at commit 1045eec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
    • class OnlineLDAOptimizer extends LDAOptimizer

@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 3, 2015

@jkbradley PR merged. Thanks for the great help.

@jkbradley
Copy link
Member

LGTM I'll go ahead and merge this into master, and we can make small fixes + add docs/examples as needed after that. Thanks very much for working with me to get online LDA in!

@asfgit asfgit closed this in 3539cb7 May 4, 2015
@hhbyyh
Copy link
Contributor Author

hhbyyh commented May 4, 2015

It's really been great to work with you. Thanks for walking me through the merge process. I can imagine no better help and review from a committer.

jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 28, 2015
JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of  Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.

Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.

 Correctness test.
I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.

Author: Yuhao Yang <[email protected]>
Author: Joseph K. Bradley <[email protected]>

Closes apache#4419 from hhbyyh/ldaonline and squashes the following commits:

1045eec [Yuhao Yang] Merge pull request apache#2 from jkbradley/hhbyyh-ldaonline2
cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors.  Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter.
6149ca6 [Yuhao Yang] fix for setOptimizer
cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
54cf8da [Yuhao Yang] some style change
68c2318 [Yuhao Yang] add a java ut
4041723 [Yuhao Yang] add ut
138bfed [Yuhao Yang] Merge pull request apache#1 from jkbradley/hhbyyh-ldaonline-update
9e910d9 [Joseph K. Bradley] small fix
61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style)
a996a82 [Yuhao Yang] respond to comments
b1178cf [Yuhao Yang] fit into the optimizer framework
dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
d19ef55 [Yuhao Yang] change OnlineLDA to class
97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e7bf3b0 [Yuhao Yang] move to seperate file
f367cc9 [Yuhao Yang] change to optimization
8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
02d0373 [Yuhao Yang] fix style in comment
f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline
d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
a570c9a [Yuhao Yang] use sample to pick up batch
4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e271eb1 [Yuhao Yang] remove non ascii
581c623 [Yuhao Yang] seperate API and adjust batch split
37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline
20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i
aa365d1 [Yuhao Yang] merge upstream master
3a06526 [Yuhao Yang] merge with new example
0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline
0d0f3ee [Yuhao Yang] replace random split with sliding
fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline
45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s
f41c5ca [Yuhao Yang] style fix
26dca1b [Yuhao Yang] style fix and make class private
043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: 	mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
d640d9c [Yuhao Yang] online lda initial checkin
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request Jun 12, 2015
JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of  Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.

Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.

 Correctness test.
I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.

Author: Yuhao Yang <[email protected]>
Author: Joseph K. Bradley <[email protected]>

Closes apache#4419 from hhbyyh/ldaonline and squashes the following commits:

1045eec [Yuhao Yang] Merge pull request apache#2 from jkbradley/hhbyyh-ldaonline2
cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors.  Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter.
6149ca6 [Yuhao Yang] fix for setOptimizer
cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
54cf8da [Yuhao Yang] some style change
68c2318 [Yuhao Yang] add a java ut
4041723 [Yuhao Yang] add ut
138bfed [Yuhao Yang] Merge pull request apache#1 from jkbradley/hhbyyh-ldaonline-update
9e910d9 [Joseph K. Bradley] small fix
61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style)
a996a82 [Yuhao Yang] respond to comments
b1178cf [Yuhao Yang] fit into the optimizer framework
dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
d19ef55 [Yuhao Yang] change OnlineLDA to class
97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e7bf3b0 [Yuhao Yang] move to seperate file
f367cc9 [Yuhao Yang] change to optimization
8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
02d0373 [Yuhao Yang] fix style in comment
f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline
d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
a570c9a [Yuhao Yang] use sample to pick up batch
4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e271eb1 [Yuhao Yang] remove non ascii
581c623 [Yuhao Yang] seperate API and adjust batch split
37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline
20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i
aa365d1 [Yuhao Yang] merge upstream master
3a06526 [Yuhao Yang] merge with new example
0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline
0d0f3ee [Yuhao Yang] replace random split with sliding
fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline
45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s
f41c5ca [Yuhao Yang] style fix
26dca1b [Yuhao Yang] style fix and make class private
043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: 	mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
d640d9c [Yuhao Yang] online lda initial checkin
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of  Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.

Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.

 Correctness test.
I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.

Author: Yuhao Yang <[email protected]>
Author: Joseph K. Bradley <[email protected]>

Closes apache#4419 from hhbyyh/ldaonline and squashes the following commits:

1045eec [Yuhao Yang] Merge pull request apache#2 from jkbradley/hhbyyh-ldaonline2
cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors.  Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter.
6149ca6 [Yuhao Yang] fix for setOptimizer
cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
54cf8da [Yuhao Yang] some style change
68c2318 [Yuhao Yang] add a java ut
4041723 [Yuhao Yang] add ut
138bfed [Yuhao Yang] Merge pull request apache#1 from jkbradley/hhbyyh-ldaonline-update
9e910d9 [Joseph K. Bradley] small fix
61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style)
a996a82 [Yuhao Yang] respond to comments
b1178cf [Yuhao Yang] fit into the optimizer framework
dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
d19ef55 [Yuhao Yang] change OnlineLDA to class
97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e7bf3b0 [Yuhao Yang] move to seperate file
f367cc9 [Yuhao Yang] change to optimization
8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
02d0373 [Yuhao Yang] fix style in comment
f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline
d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
a570c9a [Yuhao Yang] use sample to pick up batch
4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e271eb1 [Yuhao Yang] remove non ascii
581c623 [Yuhao Yang] seperate API and adjust batch split
37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline
20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i
aa365d1 [Yuhao Yang] merge upstream master
3a06526 [Yuhao Yang] merge with new example
0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline
0d0f3ee [Yuhao Yang] replace random split with sliding
fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline
45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s
f41c5ca [Yuhao Yang] style fix
26dca1b [Yuhao Yang] style fix and make class private
043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: 	mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
d640d9c [Yuhao Yang] online lda initial checkin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants