(LDA): scalability issue in updateCounter. #30

hucheng · 2015-08-26T10:11:58Z

Current updateCounter is implemented via aggregateMessages. It will create an dense arrays with length of vertices in each partition, and each element is a sparseVector. When the number of vertices in one partition is huge (consider 1B vertices), it cannot be hold in memory.
This can be solved by:

better graph partition approach such that the number of vertices in a partition is limited even when the graph (input data) is super large.
use aggregateByKey. There are several advantages against aggregateMessages:
(1). For each edge, aggregateMessages will new a sparseVector (edge attribute is an array), and new a sparseVector that is the result of sparseVector+sparseVector.
(2). Why not reduceByKey is because it seqOp definition (U, V) => U does not needs V to be the same type with U, thus unnecessary to new sparseVector from edge attribute array. Besides, to avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
(3). The side-effect of aggregateByKey is that it needs a sort-phase if sort-based shuffle is used.

bhoppi · 2015-09-11T01:52:02Z

The 2nd issue is solved now. We don't use aggregateMessages nor aggregateByKey. We do it by ourselves. The implementation is like aggregateMessages so no sorting needed, and it is done in-place so not much memory needed and doesn't need create many new sparseVectors.

hucheng · 2015-09-12T09:31:30Z

Great. The same, please share the pr here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(LDA): scalability issue in updateCounter. #30

(LDA): scalability issue in updateCounter. #30

hucheng commented Aug 26, 2015

bhoppi commented Sep 11, 2015

hucheng commented Sep 12, 2015

(LDA): scalability issue in updateCounter. #30

(LDA): scalability issue in updateCounter. #30

Comments

hucheng commented Aug 26, 2015

bhoppi commented Sep 11, 2015

hucheng commented Sep 12, 2015