[SPARK-5563][mllib] LDA with online variational inference #4419

hhbyyh · 2015-02-06T04:17:58Z

JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for Online LDA based on the research of Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.

Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.

Correctness test.
I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.

SparkQA · 2015-02-06T04:23:49Z

Test build #26895 has finished for PR 4419 at commit d640d9c.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class OnlineLDAOptimizer(

s Conflicts: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala

SparkQA · 2015-02-06T05:43:47Z

Test build #26899 has finished for PR 4419 at commit 26dca1b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-06T07:39:45Z

Test build #26901 has finished for PR 4419 at commit f41c5ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

s

SparkQA · 2015-02-10T04:59:15Z

Test build #27176 has finished for PR 4419 at commit 0d0f3ee.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-02-10T06:51:17Z

Test build #27177 has finished for PR 4419 at commit 3a06526.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-02-23T19:56:39Z

@hhbyyh Thanks for the initial PR! Here are some high-level comments:

RDD.sliding(): This may not take much advantage of parallelism. It slides across the RDD by partitions first, meaning that only 1 (or a few) workers will be active on each iteration. For the batch (RDD) setting, I wonder if it would be better to sample. That would mean stochastic gradient descent, and it would hopefully be faster because of the expense of computing the gradient. That would require some testing on an actual cluster to know for sure.
local vs. distributed models: The EM implementation supports very large vocabularies, where the topic distributions have to be distributed (the "term" vertices in the Graph). It would be nice if the online LDA could support that too. (I have heard of many use cases involving k and vocabSize large enough that the model would take many GB to store.) However, I realize that storing the model (topics) locally is helpful for efficiency if the model is small enough. Could you please sketch out how we might maintain a distributed model and the costs of doing that?
Returning DistributedLDAModel vs. LDAModel: It's true that online LDA should not return the current DistributedLDAModel since DistributedLDAModel has methods for returning info about the full training dataset. That makes me wonder if we should have a different algorithm API for online LDA (OnlineLDA alongside LDA). Does that sound reasonable?
code readability (though I know this is a WIP PR right now)
- It will be helpful to have more comments and organization in the core optimization part of the code for reviewers to understand it.
- Relatedly, it will be helpful to have the optimization steps (computing the gradient, computing the regularization, making the update, etc.) be separated out. The optimization framework in MLlib is not suitable for you to use yet, probably, but hopefully it will be in the future (after this PR). Separation of parts will help with those future changes.

i

SparkQA · 2015-03-02T11:58:36Z

Test build #28168 has finished for PR 4419 at commit 581c623.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2015-03-02T16:16:55Z

@jkbradley. I was on vacation last two weeks. Really appreciate the detailed comments and I know how time consuming it can be.

About batch split. I used docId % batchNumber to split documents into batchNumber batches in the new commit. Will that work? I'm not sure I understand how stochastic gradient descent help in this case.
local vs. distributed models: Indeed capacity of current implementation is limited by the local matrix (lambda: vocabSize * k < 2 ^31 - 1). Since online LDA don't need to hold the entire corpus, documents number is not a concern. In each seqop of the aggregate, matrix in calculation is bound to k * ids, where ids is the number of terms in each document. So the problem is how to resolve the limitation on lambda. My initial idea is to support local matrix for now and add support for distributed matrix in the future. I'll explore the upper limit for the current local matrix. (scale estimation is 100000 (vocab) * 1000 (topics), no limit on documents number)

I made some changes according to the last two points. Not sure about how to fit current version to the optimization steps. I thought the code is only for LDA and hard to be used in other context. Is there any example I can refer to? Thanks a lot.

SparkQA · 2015-03-02T16:37:22Z

Test build #28174 has finished for PR 4419 at commit e271eb1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-03-02T20:57:25Z

Thanks for the updates! Responding:

About batch split. I used docId % batchNumber to split documents into batchNumber batches in the new commit. Will that work? I'm not sure I understand how stochastic gradient descent help in this case.

That should help distribute the work; it will be good to see numbers about whether subsampling speeds things up enough. (I mentioned SGD because you could take a random sample on each iteration, rather than a deterministic sample. You wouldn't be able to use the other SGD code in MLlib, but a random sample would effectively be doing mini-batch SGD. That might be a bit better since stochasticity is usually helpful in these non-convex problems.)

My initial idea is to support local matrix for now and add support for distributed matrix in the future.

That sounds good. I don't think you need to implement a distributed version in this PR, but it will be good to think about to make sure we can later generalize to a distributed version without breaking APIs.

Not sure about how to fit current version to the optimization steps. I thought the code is only for LDA and hard to be used in other context. Is there any example I can refer to?

There's a nice explanation in Section 2.3 of the original paper: Online Learning for Latent Dirichlet Allocation. I haven't thought carefully about whether this affects computation, but I think it'd be doable. Don't bother, though, if it makes the code harder to understand; I mainly hope it will make the code easier to understand.

I'll try to make another close pass soon!

hhbyyh · 2015-03-03T03:22:16Z

how about randomSplit for batch split?

And you may refer to the python version on http://www.cs.princeton.edu/~mdhoffma/ to better understand the code. I try to stick to the original paper and implementation to ensure correctness.

jkbradley · 2015-03-03T23:10:32Z

I'd recommend RDD.sample() with replacement for sampling.

jkbradley · 2015-03-03T23:12:22Z

As far as understanding the code, it's really more for the benefit of future developers than for me. Sticking with the layout in Hoffman's code is fine with me, but I suspect we'll refactor to use general gradient-based optimization methods at some point in the future.

SparkQA · 2015-04-30T04:13:50Z

Test build #31374 has finished for PR 4419 at commit 68c2318.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LDAOptimizer
- class EMLDAOptimizer extends LDAOptimizer
- class OnlineLDAOptimizer extends LDAOptimizer
This patch does not change any dependencies.

jkbradley · 2015-04-30T21:42:38Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

+    submitMiniBatch(batch)
+  }
+
+


scala style: remove extra newline

jkbradley · 2015-04-30T21:43:12Z

@hhbyyh Thanks for the updates! Apologies for the delayed review; I just got off a flight. I just made a few tiny comments and will try to make a final pass later today or early tomorrow.

jkbradley · 2015-04-30T21:46:28Z

mllib/src/test/java/org/apache/spark/mllib/clustering/JavaLDASuite.java

+
+      // Train a model
+      OnlineLDAOptimizer op = new OnlineLDAOptimizer().setTau_0(1024).setKappa(0.51)
+              .setGammaShape(1e40).setMiniBatchFraction(0.5);


java style: 2 space indentation (everywhere)

hhbyyh · 2015-05-01T00:34:30Z

Thanks Joseph. Take your time. I'll update according to your comments first.

SparkQA · 2015-05-01T02:21:52Z

Test build #31496 has finished for PR 4419 at commit 54cf8da.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LDAOptimizer
- class EMLDAOptimizer extends LDAOptimizer
- class OnlineLDAOptimizer extends LDAOptimizer
This patch does not change any dependencies.

hhbyyh · 2015-05-01T02:33:05Z

get a wired mima exception from Spark Sql, merge master code and retry.

SparkQA · 2015-05-01T02:48:14Z

Test build #31500 has finished for PR 4419 at commit cf0007d.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LDAOptimizer
- class EMLDAOptimizer extends LDAOptimizer
- class OnlineLDAOptimizer extends LDAOptimizer
This patch does not change any dependencies.

hhbyyh · 2015-05-01T03:40:34Z

There was a mima exception from Spark SQL and it has been cleared with: beeafcf .
Need a retest on this.

SparkQA · 2015-05-01T18:35:05Z

Test build #31571 has finished for PR 4419 at commit 6149ca6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LDAOptimizer
- class EMLDAOptimizer extends LDAOptimizer
- class OnlineLDAOptimizer extends LDAOptimizer

jkbradley · 2015-05-02T21:13:22Z

@hhbyyh I'm making a final pass. I'd like to send 1 final clean-up PR based on viewing the generated Java/Scala docs.

Also, could you please update the PR title? Thanks!

…cessors. Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter. Change miniBatchFraction default to 0.05 to match maxIterations. Added a little doc. Changed end of main online LDA update code to avoid the kron() call. Please confirm if you agree that should be more efficient (not explicitly instantiating a big matrix). Changed Gamma() to use random seed. Scala style updates

Various cleanups, use random seed, optimization

SparkQA · 2015-05-03T02:23:15Z

Test build #31674 has finished for PR 4419 at commit 1045eec.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LDAOptimizer
- class EMLDAOptimizer extends LDAOptimizer
- class OnlineLDAOptimizer extends LDAOptimizer

hhbyyh · 2015-05-03T02:27:43Z

@jkbradley PR merged. Thanks for the great help.

jkbradley · 2015-05-04T07:05:19Z

LGTM I'll go ahead and merge this into master, and we can make small fixes + add docs/examples as needed after that. Thanks very much for working with me to get online LDA in!

hhbyyh · 2015-05-04T12:48:16Z

It's really been great to work with you. Thanks for walking me through the merge process. I can imagine no better help and review from a committer.

JIRA: https://issues.apache.org/jira/browse/SPARK-5563 The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira. Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus. Correctness test. I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation. Author: Yuhao Yang <[email protected]> Author: Joseph K. Bradley <[email protected]> Closes apache#4419 from hhbyyh/ldaonline and squashes the following commits: 1045eec [Yuhao Yang] Merge pull request apache#2 from jkbradley/hhbyyh-ldaonline2 cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors. Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter. 6149ca6 [Yuhao Yang] fix for setOptimizer cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline 54cf8da [Yuhao Yang] some style change 68c2318 [Yuhao Yang] add a java ut 4041723 [Yuhao Yang] add ut 138bfed [Yuhao Yang] Merge pull request apache#1 from jkbradley/hhbyyh-ldaonline-update 9e910d9 [Joseph K. Bradley] small fix 61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style) a996a82 [Yuhao Yang] respond to comments b1178cf [Yuhao Yang] fit into the optimizer framework dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline 15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline d19ef55 [Yuhao Yang] change OnlineLDA to class 97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline e7bf3b0 [Yuhao Yang] move to seperate file f367cc9 [Yuhao Yang] change to optimization 8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline 62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline 02d0373 [Yuhao Yang] fix style in comment f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline a570c9a [Yuhao Yang] use sample to pick up batch 4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline e271eb1 [Yuhao Yang] remove non ascii 581c623 [Yuhao Yang] seperate API and adjust batch split 37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline 20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i aa365d1 [Yuhao Yang] merge upstream master 3a06526 [Yuhao Yang] merge with new example 0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline 0d0f3ee [Yuhao Yang] replace random split with sliding fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline 45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s f41c5ca [Yuhao Yang] style fix 26dca1b [Yuhao Yang] style fix and make class private 043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala d640d9c [Yuhao Yang] online lda initial checkin

online lda initial checkin

d640d9c

hhbyyh added 2 commits February 6, 2015 12:39

Merge remote-tracking branch 'upstream/master' into ldaonline

043e786

s Conflicts: mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala

style fix and make class private

26dca1b

style fix

f41c5ca

hhbyyh added 3 commits February 8, 2015 11:32

Merge remote-tracking branch 'upstream/master' into ldaonline

45884ab

s

ssMerge remote-tracking branch 'upstream/master' into ldaonline

fa408a8

replace random split with sliding

0d0f3ee

hhbyyh added 2 commits February 10, 2015 13:12

kMerge remote-tracking branch 'upstream/master' into ldaonline

0dd3947

merge with new example

3a06526

hhbyyh added 4 commits March 2, 2015 10:19

merge upstream master

aa365d1

Merge remote-tracking branch 'upstream/master' into ldaonline

20328d1

i

iMerge remote-tracking branch 'upstream/master' into ldaonline

37af91a

seperate API and adjust batch split

581c623

remove non ascii

e271eb1

hhbyyh added 3 commits March 5, 2015 10:19

Merge remote-tracking branch 'upstream/master' into ldaonline

4a3f27e

use sample to pick up batch

a570c9a

Merge remote-tracking branch 'upstream/master' into ldaonline

d86cdec

jkbradley reviewed Apr 30, 2015
View reviewed changes

some style change

54cf8da

Merge remote-tracking branch 'upstream/master' into ldaonline

cf0007d

fix for setOptimizer

6149ca6

hhbyyh changed the title ~~[SPARK-5563][mllib] online lda initial checkin~~ [SPARK-5563][mllib] Add OnlineLDAOptimizer to LDA (with UT) May 3, 2015

hhbyyh changed the title ~~[SPARK-5563][mllib] Add OnlineLDAOptimizer to LDA (with UT)~~ [SPARK-5563][mllib] LDA with online variational inference May 3, 2015

Merge pull request #2 from jkbradley/hhbyyh-ldaonline2

1045eec

Various cleanups, use random seed, optimization

asfgit closed this in 3539cb7 May 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-5563][mllib] LDA with online variational inference #4419

[SPARK-5563][mllib] LDA with online variational inference #4419

hhbyyh commented Feb 6, 2015

SparkQA commented Feb 6, 2015

SparkQA commented Feb 6, 2015

SparkQA commented Feb 6, 2015

SparkQA commented Feb 10, 2015

SparkQA commented Feb 10, 2015

jkbradley commented Feb 23, 2015

SparkQA commented Mar 2, 2015

hhbyyh commented Mar 2, 2015

SparkQA commented Mar 2, 2015

jkbradley commented Mar 2, 2015

hhbyyh commented Mar 3, 2015

jkbradley commented Mar 3, 2015

jkbradley commented Mar 3, 2015

SparkQA commented Apr 30, 2015

jkbradley Apr 30, 2015

jkbradley commented Apr 30, 2015

jkbradley Apr 30, 2015

hhbyyh commented May 1, 2015

SparkQA commented May 1, 2015

hhbyyh commented May 1, 2015

SparkQA commented May 1, 2015

hhbyyh commented May 1, 2015

SparkQA commented May 1, 2015

jkbradley commented May 2, 2015

SparkQA commented May 3, 2015

hhbyyh commented May 3, 2015

jkbradley commented May 4, 2015

hhbyyh commented May 4, 2015

[SPARK-5563][mllib] LDA with online variational inference #4419

[SPARK-5563][mllib] LDA with online variational inference #4419

Conversation

hhbyyh commented Feb 6, 2015

SparkQA commented Feb 6, 2015

SparkQA commented Feb 6, 2015

SparkQA commented Feb 6, 2015

SparkQA commented Feb 10, 2015

SparkQA commented Feb 10, 2015

jkbradley commented Feb 23, 2015

SparkQA commented Mar 2, 2015

hhbyyh commented Mar 2, 2015

SparkQA commented Mar 2, 2015

jkbradley commented Mar 2, 2015

hhbyyh commented Mar 3, 2015

jkbradley commented Mar 3, 2015

jkbradley commented Mar 3, 2015

SparkQA commented Apr 30, 2015

jkbradley Apr 30, 2015

Choose a reason for hiding this comment

jkbradley commented Apr 30, 2015

jkbradley Apr 30, 2015

Choose a reason for hiding this comment

hhbyyh commented May 1, 2015

SparkQA commented May 1, 2015

hhbyyh commented May 1, 2015

SparkQA commented May 1, 2015

hhbyyh commented May 1, 2015

SparkQA commented May 1, 2015

jkbradley commented May 2, 2015

SparkQA commented May 3, 2015

hhbyyh commented May 3, 2015

jkbradley commented May 4, 2015

hhbyyh commented May 4, 2015