Optimizations for mapReduceTriplets and EdgePartition #3054

ankurdave · 2014-11-01T23:33:17Z

EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages.
Internal iterators in mapReduceTriplets are inlined into a while loop.

These optimizations were tested to provide a 21.4% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 403 s).

Also fixes SPARK-4173.

1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages. 2. Internal iterators in mapReduceTriplets are inlined into a while loop.

SparkQA · 2014-11-01T23:40:24Z

Test build #22712 has started for PR 3054 at commit 4a566dc.

This patch merges cleanly.

SparkQA · 2014-11-02T01:00:22Z

Test build #22712 has finished for PR 3054 at commit 4a566dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-02T01:00:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22712/
Test PASSed.

rxin · 2014-11-02T06:59:52Z

graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala

-    this.withVertices(vertices.innerJoinKeepLeft(iter))
+    val newVertexAttrs = new Array[VD](vertexAttrs.length)
+    System.arraycopy(vertexAttrs, 0, newVertexAttrs, 0, vertexAttrs.length)
+    iter.foreach { kv =>


maybe rewrite this with while loop

Also rename VertexPreservingEdgePartitionBuilder to ExistingEdgePartitionBuilder to better reflect its usage.

ankurdave · 2014-11-04T10:04:46Z

@rxin Thanks for the comments. I addressed them and made some other improvements. PTAL

SparkQA · 2014-11-04T10:10:35Z

Test build #22874 has started for PR 3054 at commit 194a2df.

This patch merges cleanly.

AmplabJenkins · 2014-11-04T10:17:18Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22873/
Test FAILed.

ankurdave · 2014-11-04T10:45:11Z

Jenkins, retest this please.

SparkQA · 2014-11-04T10:49:56Z

Test build #22876 has started for PR 3054 at commit 194a2df.

This patch merges cleanly.

SparkQA · 2014-11-04T11:35:57Z

Test build #22874 has finished for PR 3054 at commit 194a2df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-04T11:36:00Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22874/
Test PASSed.

SparkQA · 2014-11-04T12:15:16Z

Test build #22876 has finished for PR 3054 at commit 194a2df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-04T12:15:20Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22876/
Test PASSed.

aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements: 1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages. This is more efficient, providing a 20.2% speedup on PageRank over apache#3054 (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 403 s to 322 s). 2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936. Subsumes apache#2815.

rxin · 2014-11-11T00:25:12Z

graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala

 * @param activeSet an optional active vertex set for filtering computation on the edges
 */
 private[graphx]
 class EdgePartition[
    @specialized(Char, Int, Boolean, Byte, Long, Float, Double) ED: ClassTag, VD: ClassTag](
-    val srcIds: Array[VertexId] = null,
-    val dstIds: Array[VertexId] = null,
+    val localSrcIds: Array[Int] = null,


can we create an explicit empty ctor instead of having null value for everything? and in that ctor say it is only needed for serialization.

also try to make all of these private rather than val's. (just remove the val)

rxin · 2014-11-11T00:38:53Z

graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala

-      edgeIter: Iterator[Edge[ED]], includeSrc: Boolean = true, includeDst: Boolean = true)
-    : Iterator[EdgeTriplet[VD, ED]] = {
-    new ReusingEdgeTripletIterator(edgeIter, this, includeSrc, includeDst)
+  def mapReduceTriplets[A: ClassTag](


might be better to rename this mapRedueTripletsEdgeScan, to contrast with the other index scan one.

ankurdave · 2014-11-12T05:28:22Z

Closing this in favor of #3100, which incorporates these changes.

@rxin

aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements: 1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages. 2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936. Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition: 1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages. 2. Internal iterators in aggregateMessages are inlined into a while loop. In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s). Subsumes apache#2815. Also fixes SPARK-4173. Author: Ankur Dave <[email protected]> Closes apache#3100 from ankurdave/aggregateMessages and squashes the following commits: f5b65d0 [Ankur Dave] Address @rxin comments on apache#3054 and apache#3100 1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets 194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder c85076d [Ankur Dave] Readability improvements b567be2 [Ankur Dave] iter.foreach -> while loop 4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition

@rxin

aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements: 1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages. 2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936. Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition: 1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages. 2. Internal iterators in aggregateMessages are inlined into a while loop. In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s). Subsumes #2815. Also fixes SPARK-4173. Author: Ankur Dave <[email protected]> Closes #3100 from ankurdave/aggregateMessages and squashes the following commits: f5b65d0 [Ankur Dave] Address @rxin comments on #3054 and #3100 1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets 194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder c85076d [Ankur Dave] Readability improvements b567be2 [Ankur Dave] iter.foreach -> while loop 4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition (cherry picked from commit faeb41d) Signed-off-by: Reynold Xin <[email protected]>

Optimizations for mapReduceTriplets and EdgePartition

4a566dc

1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages. 2. Internal iterators in mapReduceTriplets are inlined into a while loop.

rxin reviewed Nov 2, 2014
View reviewed changes

ankurdave added 4 commits November 4, 2014 01:56

iter.foreach -> while loop

b567be2

Readability improvements

c85076d

Take activeSet in ExistingEdgePartitionBuilder

e0f8ecc

Also rename VertexPreservingEdgePartitionBuilder to ExistingEdgePartitionBuilder to better reflect its usage.

Test triplet iterator in EdgePartition serialization test

194a2df

ankurdave mentioned this pull request Nov 5, 2014

[SPARK-3936] Add aggregateMessages, which supersedes mapReduceTriplets #3100

Closed

rxin reviewed Nov 11, 2014
View reviewed changes

ankurdave closed this Nov 12, 2014

ankurdave added a commit to ankurdave/spark that referenced this pull request Nov 12, 2014

Address @rxin comments on apache#3054 and apache#3100

f5b65d0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for mapReduceTriplets and EdgePartition #3054

Optimizations for mapReduceTriplets and EdgePartition #3054

ankurdave commented Nov 1, 2014

SparkQA commented Nov 1, 2014

SparkQA commented Nov 2, 2014

AmplabJenkins commented Nov 2, 2014

rxin Nov 2, 2014

ankurdave commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

ankurdave commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

rxin Nov 11, 2014

rxin Nov 11, 2014

rxin Nov 11, 2014

ankurdave commented Nov 12, 2014

Optimizations for mapReduceTriplets and EdgePartition #3054

Optimizations for mapReduceTriplets and EdgePartition #3054

Conversation

ankurdave commented Nov 1, 2014

SparkQA commented Nov 1, 2014

SparkQA commented Nov 2, 2014

AmplabJenkins commented Nov 2, 2014

rxin Nov 2, 2014

Choose a reason for hiding this comment

ankurdave commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

ankurdave commented Nov 4, 2014

SparkQA commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

SparkQA commented Nov 4, 2014

AmplabJenkins commented Nov 4, 2014

rxin Nov 11, 2014

Choose a reason for hiding this comment

rxin Nov 11, 2014

Choose a reason for hiding this comment

rxin Nov 11, 2014

Choose a reason for hiding this comment

ankurdave commented Nov 12, 2014