You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current updateCounter is implemented via aggregateMessages. It will create an dense arrays with length of vertices in each partition, and each element is a sparseVector. When the number of vertices in one partition is huge (consider 1B vertices), it cannot be hold in memory.
This can be solved by:
better graph partition approach such that the number of vertices in a partition is limited even when the graph (input data) is super large.
use aggregateByKey. There are several advantages against aggregateMessages:
(1). For each edge, aggregateMessages will new a sparseVector (edge attribute is an array), and new a sparseVector that is the result of sparseVector+sparseVector.
(2). Why not reduceByKey is because it seqOp definition (U, V) => U does not needs V to be the same type with U, thus unnecessary to new sparseVector from edge attribute array. Besides, to avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
(3). The side-effect of aggregateByKey is that it needs a sort-phase if sort-based shuffle is used.
The text was updated successfully, but these errors were encountered:
The 2nd issue is solved now. We don't use aggregateMessages nor aggregateByKey. We do it by ourselves. The implementation is like aggregateMessages so no sorting needed, and it is done in-place so not much memory needed and doesn't need create many new sparseVectors.
Current updateCounter is implemented via aggregateMessages. It will create an dense arrays with length of vertices in each partition, and each element is a sparseVector. When the number of vertices in one partition is huge (consider 1B vertices), it cannot be hold in memory.
This can be solved by:
(1). For each edge, aggregateMessages will new a sparseVector (edge attribute is an array), and new a sparseVector that is the result of sparseVector+sparseVector.
(2). Why not reduceByKey is because it seqOp definition (U, V) => U does not needs V to be the same type with U, thus unnecessary to new sparseVector from edge attribute array. Besides, to avoid memory allocation, both of these functions are allowed to modify and return their first argument instead of creating a new U.
(3). The side-effect of aggregateByKey is that it needs a sort-phase if sort-based shuffle is used.
The text was updated successfully, but these errors were encountered: