-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge Hadoop Into Spark #286
Conversation
This patch merges the Hadoop 0.20.2 source code into the Spark project. I've thought about this a bunch and this will provide us with several benefits: Let's be honest, to be taken seriously as a project Spark needs to have _way more_ lines of code. Spark is currently 70,000 lines of Scala code - this patch adds 452,000 lines of XML alone (!) This will make our github stats look great! Sometimes users stumble trying to build Spark against Hadoop. Not anymore!! With Hadoop inside of Spark this won't a problem at all. I mean, there's basically only one Hadoop version, right? So this should work for pretty much everyone. Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s. This patch will give users access to some of their favorite old configs from Hadoop. Did you just figure out what `dfs.namenode.path.based.cache.block.map.allocation.percent` was?! Now you can use it in Spark!! Pining for your old friend `mapreduce.map.skip.proc.count.autoincr`... fuggedabodit - we got ya! I plan to contribute tests and docs in a subsequent patch. Please merge this ASAP and include in Spark 0.9.1.
+1 !!! I have been asking for this multiple times on the mailing list and finally see the light!!! |
Merged build triggered. |
Merged build started. |
Merged build finished. |
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13636/ |
But will this work on YARN? |
The best way to have more lines than some project is to merge it! |
I love hadoop-0.20.2 -- It is the best Hadoop I have ever used. Thanks @pwendell for pulling this in. |
LGTM, merged ! On Wed, Apr 2, 2014 at 12:20 AM, Shivaram Venkataraman <
|
LGTMT |
Great job! |
I'm OK with holding off on this for a separate JIRA, but for completeness I'd like to propose merging in CDH code as well. I believe that CDH4.2.2 would be the best choice of version, as it maintains 2-compatibility with 0.20.2 and 2.2.0. |
Guys, I just realized that this includes a really cool submodule called "Map Reduce" which seems to be a generalized form of Spark! It also has around 3x the amount of code, and that's in Java (which is somewhat more concise than Scala), so I estimate it's roughly 5x as awesome as Spark. Are there plans for deprecating the Spark API in favor of "Map Reduce"? If we deprecate Spark in 0.9.1 and replace it with Map Reduce in 1.0.0, that should give users enough time to migrate to the new API (which is much simpler -- only two functions!). |
+1, lgtm, but only today! |
Wait, so did it pass or fail Jenkins tests? Jenkins isn't saying what happened, only that the build finished. Anyway I'd be okay with this as long as you make it meet our code style guidelines. |
amazing! I've been looking forward it for a long time! |
I do not know, ah, but I'm looking forward to it |
This is a good start! Why not merge all transitive dependencies into Spark altogether? In this way we can save lots of efforts on SBT/Maven, and avoid time consuming Ivy dependency resolution. You know, ever since those .orbit things were added, OK, the 2nd paragraph is not part of the joke :-) |
Yep, I find that each time I do |
Is it true about this => “ Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s. ” |
"But will this work on YARN?", I've the same question. |
+1. Really terrible nightmare. If the version of some jar is 6.4, it usually can not be downloaded in mainland China without a proxy :( |
JIRA issue: [SPARK-1402](https://issues.apache.org/jira/browse/SPARK-1402) This PR provides 3 more compression schemes for Spark SQL in-memory columnar storage: * `BooleanBitSet` * `IntDelta` * `LongDelta` Now there are 6 compression schemes in total, including the no-op `PassThrough` scheme. Also fixed a bug in PR #286: not all compression schemes are added as available schemes when accessing an in-memory column, and when a column is compressed with an unrecognised scheme, `ColumnAccessor` throws exception. Author: Cheng Lian <[email protected]> Closes #330 from liancheng/moreCompressionSchemes and squashes the following commits: 1d037b8 [Cheng Lian] Fixed SPARK-1436: in-memory column byte buffer must be able to be accessed multiple times d7c0e8f [Cheng Lian] Added test suite for IntegralDelta (IntDelta & LongDelta) 3c1ad7a [Cheng Lian] Added test suite for BooleanBitSet, refactored other test suites 44fe4b2 [Cheng Lian] Refactored CompressionScheme, added 3 more compression schemes.
JIRA issue: [SPARK-1402](https://issues.apache.org/jira/browse/SPARK-1402) This PR provides 3 more compression schemes for Spark SQL in-memory columnar storage: * `BooleanBitSet` * `IntDelta` * `LongDelta` Now there are 6 compression schemes in total, including the no-op `PassThrough` scheme. Also fixed a bug in PR apache#286: not all compression schemes are added as available schemes when accessing an in-memory column, and when a column is compressed with an unrecognised scheme, `ColumnAccessor` throws exception. Author: Cheng Lian <[email protected]> Closes apache#330 from liancheng/moreCompressionSchemes and squashes the following commits: 1d037b8 [Cheng Lian] Fixed SPARK-1436: in-memory column byte buffer must be able to be accessed multiple times d7c0e8f [Cheng Lian] Added test suite for IntegralDelta (IntDelta & LongDelta) 3c1ad7a [Cheng Lian] Added test suite for BooleanBitSet, refactored other test suites 44fe4b2 [Cheng Lian] Refactored CompressionScheme, added 3 more compression schemes.
* Replace submission v1 with submission v2. * Address documentation changes. * Fix documentation
* Replace submission v1 with submission v2. * Address documentation changes. * Fix documentation
…public-clouds-job Refactor docker machine job to run specific tests set and add cleanup machanism
…l as Spark uses Hive-1.2 (apache#286)
### What changes were proposed in this pull request? This PR upgrades `GenJavadoc` plugin from `0.17` to `0.18`. ### Why are the changes needed? `0.18` includes a bug fix for `Scala 2.13`. ``` This release fixes a bug (#286) with Scala 2.13.6 in relation with deprecated annotations in Scala sources leading to a NoSuchElementException in some cases. ``` https://github.com/lightbend/genjavadoc/releases/tag/v0.18 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc for Scala 2.13. ``` build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc ``` Closes #33383 from sarutak/upgrade-genjavadoc-0.18. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR upgrades `GenJavadoc` plugin from `0.17` to `0.18`. ### Why are the changes needed? `0.18` includes a bug fix for `Scala 2.13`. ``` This release fixes a bug (#286) with Scala 2.13.6 in relation with deprecated annotations in Scala sources leading to a NoSuchElementException in some cases. ``` https://github.com/lightbend/genjavadoc/releases/tag/v0.18 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc for Scala 2.13. ``` build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc ``` Closes #33383 from sarutak/upgrade-genjavadoc-0.18. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit ad744fb) Signed-off-by: Dongjoon Hyun <[email protected]>
This patch merges the Hadoop 0.20.2 source code into the Spark project. I've thought about this a bunch and this will provide us with several benefits:
More source code
Let's be honest, to be taken seriously as a project Spark needs to have way more lines of code. Spark is currently 70,000 lines of Scala code - this patch adds 452,000 lines of XML alone (!) This will make our github stats look great!
Seamless builds
Sometimes users stumble trying to build Spark against Hadoop. Not anymore!! With Hadoop inside of Spark this won't a problem at all. I mean, there's basically only one Hadoop version, right? So this should work for pretty much everyone. Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s.
Your favorite old configs
This patch will give users access to some of their favorite old configs from Hadoop. Did you just figure out what
dfs.namenode.path.based.cache.block.map.allocation.percent
was?! Now you can use it in Spark!! Pining for your old friendmapreduce.map.skip.proc.count.autoincr
... fuggedabodit - we got ya!I plan to contribute tests and docs in a subsequent patch. Please merge this ASAP and include in Spark 0.9.1.
NB: This diff is too large for github to render. Users will have to download and play with this on their own.