SPARK-1597: Add a version of reduceByKey that takes the Partitioner as a... #550

techaddict · 2014-04-25T12:48:09Z

... second argument

Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a first argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions.
Deprecated that version and added one where the Partitioner is the second argument.

…s a second argument Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a first argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions. Deprecated that version and added one where the Partitioner is the second argument.

techaddict · 2014-04-25T12:51:27Z

We'll need to specify the parameter types for function passed to reduceByKey
reduceByKey((x: Long, y: Long) => x + y, 10) instead of reduceByKey(_ + _, 10)
For detailed discussion on compiler issue causing this,
https://groups.google.com/forum/#!topic/scala-user/Qhd3vJ2rAWM

@mateiz IMHO we should leave the method as it is, as this will make the code ugly.

AmplabJenkins · 2014-04-25T12:52:55Z

Can one of the admins verify this patch?

mateiz · 2014-04-26T02:20:39Z

Ah, wow, I never knew that. So if one takes a Partitioner first and one takes a function, the types are inferred, but if both take a function first, they're not?

In that case we might want to change our other methods too, like cogroup and groupByKey, to take a Partitioner first. Wouldn't this problem also affect them?

mateiz · 2014-04-26T02:23:17Z

CC @rxin, @pwendell

techaddict · 2014-04-26T02:52:42Z

@mateiz I think this only applies with anon function's, thus isn't affecting either cogroup or groupByKey.

rxin · 2014-04-26T06:33:30Z

streaming/src/main/scala/org/apache/spark/streaming/scheduler/ReceiverTracker.scala

@@ -267,7 +267,7 @@ class ReceiverTracker(ssc: StreamingContext) extends Logging {
      // Run the dummy Spark job to ensure that all slaves have registered.
      // This avoids all the receivers to be scheduled on the same node.
      if (!ssc.sparkContext.isLocal) {
-        ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect()
+        ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 20).collect()


This line is over 100 chars wide

@rxin will fix this as soon as, a decision is made over whether we want to do this or not.

rxin · 2014-04-26T06:36:31Z

I never even realized we had a version of reduceByKey where the first argument is not the closure ...

rxin · 2014-04-26T07:16:43Z

I have one solution to this, although it is technically an API change, so just throwing it out there for discussion. We can remove all the numPartitions: Int arguments, and add an implicit conversion from int to HashPartitioner.

techaddict · 2014-04-26T07:37:59Z

@rxin +1

mateiz · 2014-04-26T22:46:37Z

I'd rather not add the implicit conversion from int to partitioner, it will be very hard to discover on its own. Instead maybe we can just leave this API as is. It's strange but there's a good reason for it.

SparkQA · 2014-08-06T02:24:26Z

QA tests have started for PR 550. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17978/consoleFull

SparkQA · 2014-08-06T02:24:33Z

QA results for PR 550:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17978/consoleFull

pwendell · 2014-09-21T04:58:44Z

It sounds like the conclusion here is to close this issue then.

This commit exists to close the following pull requests on Github: Closes apache#1328 (close requested by 'pwendell') Closes apache#2314 (close requested by 'pwendell') Closes apache#997 (close requested by 'pwendell') Closes apache#550 (close requested by 'pwendell') Closes apache#1506 (close requested by 'pwendell') Closes apache#2423 (close requested by 'mengxr') Closes apache#554 (close requested by 'joshrosen')

rxin reviewed Apr 26, 2014
View reviewed changes

techaddict mentioned this pull request Apr 29, 2014

SPARK-1663. Corrections for several compile errors in streaming code examples, and updates to follow API changes #589

Closed

techaddict closed this Sep 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1597: Add a version of reduceByKey that takes the Partitioner as a... #550

SPARK-1597: Add a version of reduceByKey that takes the Partitioner as a... #550

techaddict commented Apr 25, 2014

techaddict commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

mateiz commented Apr 26, 2014

mateiz commented Apr 26, 2014

techaddict commented Apr 26, 2014

rxin Apr 26, 2014

techaddict Apr 26, 2014

rxin commented Apr 26, 2014

rxin commented Apr 26, 2014

techaddict commented Apr 26, 2014

mateiz commented Apr 26, 2014

SparkQA commented Aug 6, 2014

SparkQA commented Aug 6, 2014

pwendell commented Sep 21, 2014

SPARK-1597: Add a version of reduceByKey that takes the Partitioner as a... #550

SPARK-1597: Add a version of reduceByKey that takes the Partitioner as a... #550

Conversation

techaddict commented Apr 25, 2014

techaddict commented Apr 25, 2014

AmplabJenkins commented Apr 25, 2014

mateiz commented Apr 26, 2014

mateiz commented Apr 26, 2014

techaddict commented Apr 26, 2014

rxin Apr 26, 2014

Choose a reason for hiding this comment

techaddict Apr 26, 2014

Choose a reason for hiding this comment

rxin commented Apr 26, 2014

rxin commented Apr 26, 2014

techaddict commented Apr 26, 2014

mateiz commented Apr 26, 2014

SparkQA commented Aug 6, 2014

SparkQA commented Aug 6, 2014

pwendell commented Sep 21, 2014