-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1597: Add a version of reduceByKey that takes the Partitioner as a... #550
Conversation
…s a second argument Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a first argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions. Deprecated that version and added one where the Partitioner is the second argument.
We'll need to specify the parameter types for function passed to reduceByKey @mateiz IMHO we should leave the method as it is, as this will make the code ugly. |
Can one of the admins verify this patch? |
Ah, wow, I never knew that. So if one takes a Partitioner first and one takes a function, the types are inferred, but if both take a function first, they're not? In that case we might want to change our other methods too, like cogroup and groupByKey, to take a Partitioner first. Wouldn't this problem also affect them? |
@mateiz I think this only applies with anon function's, thus isn't affecting either cogroup or groupByKey. |
@@ -267,7 +267,7 @@ class ReceiverTracker(ssc: StreamingContext) extends Logging { | |||
// Run the dummy Spark job to ensure that all slaves have registered. | |||
// This avoids all the receivers to be scheduled on the same node. | |||
if (!ssc.sparkContext.isLocal) { | |||
ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect() | |||
ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 20).collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is over 100 chars wide
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin will fix this as soon as, a decision is made over whether we want to do this or not.
I never even realized we had a version of reduceByKey where the first argument is not the closure ... |
I have one solution to this, although it is technically an API change, so just throwing it out there for discussion. We can remove all the numPartitions: Int arguments, and add an implicit conversion from int to HashPartitioner. |
@rxin +1 |
I'd rather not add the implicit conversion from int to partitioner, it will be very hard to discover on its own. Instead maybe we can just leave this API as is. It's strange but there's a good reason for it. |
QA tests have started for PR 550. This patch merges cleanly. |
QA results for PR 550: |
It sounds like the conclusion here is to close this issue then. |
This commit exists to close the following pull requests on Github: Closes apache#1328 (close requested by 'pwendell') Closes apache#2314 (close requested by 'pwendell') Closes apache#997 (close requested by 'pwendell') Closes apache#550 (close requested by 'pwendell') Closes apache#1506 (close requested by 'pwendell') Closes apache#2423 (close requested by 'mengxr') Closes apache#554 (close requested by 'joshrosen')
... second argument
Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a first argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions.
Deprecated that version and added one where the Partitioner is the second argument.