Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1597: Add a version of reduceByKey that takes the Partitioner as a... #550

Closed
wants to merge 1 commit into from

Conversation

techaddict
Copy link
Contributor

... second argument

Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a first argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions.
Deprecated that version and added one where the Partitioner is the second argument.

…s a second argument

Most of our shuffle methods can take a Partitioner or a number of partitions as a second argument, but for some reason reduceByKey takes the Partitioner as a first argument: http://spark.apache.org/docs/0.9.1/api/core/#org.apache.spark.rdd.PairRDDFunctions.
Deprecated that version and added one where the Partitioner is the second argument.
@techaddict
Copy link
Contributor Author

We'll need to specify the parameter types for function passed to reduceByKey
reduceByKey((x: Long, y: Long) => x + y, 10) instead of reduceByKey(_ + _, 10)
For detailed discussion on compiler issue causing this,
https://groups.google.com/forum/#!topic/scala-user/Qhd3vJ2rAWM

@mateiz IMHO we should leave the method as it is, as this will make the code ugly.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@mateiz
Copy link
Contributor

mateiz commented Apr 26, 2014

Ah, wow, I never knew that. So if one takes a Partitioner first and one takes a function, the types are inferred, but if both take a function first, they're not?

In that case we might want to change our other methods too, like cogroup and groupByKey, to take a Partitioner first. Wouldn't this problem also affect them?

@mateiz
Copy link
Contributor

mateiz commented Apr 26, 2014

CC @rxin, @pwendell

@techaddict
Copy link
Contributor Author

@mateiz I think this only applies with anon function's, thus isn't affecting either cogroup or groupByKey.

@@ -267,7 +267,7 @@ class ReceiverTracker(ssc: StreamingContext) extends Logging {
// Run the dummy Spark job to ensure that all slaves have registered.
// This avoids all the receivers to be scheduled on the same node.
if (!ssc.sparkContext.isLocal) {
ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey(_ + _, 20).collect()
ssc.sparkContext.makeRDD(1 to 50, 50).map(x => (x, 1)).reduceByKey((x: Int, y: Int) => x + y, 20).collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is over 100 chars wide

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin will fix this as soon as, a decision is made over whether we want to do this or not.

@rxin
Copy link
Contributor

rxin commented Apr 26, 2014

I never even realized we had a version of reduceByKey where the first argument is not the closure ...

@rxin
Copy link
Contributor

rxin commented Apr 26, 2014

I have one solution to this, although it is technically an API change, so just throwing it out there for discussion. We can remove all the numPartitions: Int arguments, and add an implicit conversion from int to HashPartitioner.

@techaddict
Copy link
Contributor Author

@rxin +1

@mateiz
Copy link
Contributor

mateiz commented Apr 26, 2014

I'd rather not add the implicit conversion from int to partitioner, it will be very hard to discover on its own. Instead maybe we can just leave this API as is. It's strange but there's a good reason for it.

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA tests have started for PR 550. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17978/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 6, 2014

QA results for PR 550:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17978/consoleFull

@pwendell
Copy link
Contributor

It sounds like the conclusion here is to close this issue then.

@techaddict techaddict closed this Sep 21, 2014
guavuslabs-builder pushed a commit to ThalesGroup/spark that referenced this pull request Sep 21, 2014
This commit exists to close the following pull requests on Github:

Closes apache#1328 (close requested by 'pwendell')
Closes apache#2314 (close requested by 'pwendell')
Closes apache#997 (close requested by 'pwendell')
Closes apache#550 (close requested by 'pwendell')
Closes apache#1506 (close requested by 'pwendell')
Closes apache#2423 (close requested by 'mengxr')
Closes apache#554 (close requested by 'joshrosen')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants