Merge Hadoop Into Spark #286

pwendell · 2014-04-01T18:19:05Z

This patch merges the Hadoop 0.20.2 source code into the Spark project. I've thought about this a bunch and this will provide us with several benefits:

More source code

Let's be honest, to be taken seriously as a project Spark needs to have way more lines of code. Spark is currently 70,000 lines of Scala code - this patch adds 452,000 lines of XML alone (!) This will make our github stats look great!

Seamless builds

Sometimes users stumble trying to build Spark against Hadoop. Not anymore!! With Hadoop inside of Spark this won't a problem at all. I mean, there's basically only one Hadoop version, right? So this should work for pretty much everyone. Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s.

Your favorite old configs

This patch will give users access to some of their favorite old configs from Hadoop. Did you just figure out what dfs.namenode.path.based.cache.block.map.allocation.percent was?! Now you can use it in Spark!! Pining for your old friend mapreduce.map.skip.proc.count.autoincr... fuggedabodit - we got ya!

I plan to contribute tests and docs in a subsequent patch. Please merge this ASAP and include in Spark 0.9.1.

NB: This diff is too large for github to render. Users will have to download and play with this on their own.

This patch merges the Hadoop 0.20.2 source code into the Spark project. I've thought about this a bunch and this will provide us with several benefits: Let's be honest, to be taken seriously as a project Spark needs to have _way more_ lines of code. Spark is currently 70,000 lines of Scala code - this patch adds 452,000 lines of XML alone (!) This will make our github stats look great! Sometimes users stumble trying to build Spark against Hadoop. Not anymore!! With Hadoop inside of Spark this won't a problem at all. I mean, there's basically only one Hadoop version, right? So this should work for pretty much everyone. Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s. This patch will give users access to some of their favorite old configs from Hadoop. Did you just figure out what `dfs.namenode.path.based.cache.block.map.allocation.percent` was?! Now you can use it in Spark!! Pining for your old friend `mapreduce.map.skip.proc.count.autoincr`... fuggedabodit - we got ya! I plan to contribute tests and docs in a subsequent patch. Please merge this ASAP and include in Spark 0.9.1.

rxin · 2014-04-01T18:21:40Z

+1 !!! I have been asking for this multiple times on the mailing list and finally see the light!!!

AmplabJenkins · 2014-04-01T18:22:23Z

Merged build triggered.

AmplabJenkins · 2014-04-01T18:22:30Z

Merged build started.

AmplabJenkins · 2014-04-01T18:24:02Z

Merged build finished.

AmplabJenkins · 2014-04-01T18:24:02Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13636/

aarondav · 2014-04-01T18:24:20Z

But will this work on YARN?

mengxr · 2014-04-01T18:27:42Z

The best way to have more lines than some project is to merge it!

shivaram · 2014-04-01T18:50:08Z

I love hadoop-0.20.2 -- It is the best Hadoop I have ever used. Thanks @pwendell for pulling this in.

mridulm · 2014-04-01T20:05:01Z

LGTM, merged !

On Wed, Apr 2, 2014 at 12:20 AM, Shivaram Venkataraman <
[email protected]> wrote:

I love hadoop-0.20.2 -- It is the best Hadoop I have ever used. Thanks
@pwendell https://github.com/pwendell for pulling this in.

Reply to this email directly or view it on GitHubhttps://github.com//pull/286#issuecomment-39243472
.

andrewor14 · 2014-04-01T20:18:28Z

LGTMT

CodingCat · 2014-04-01T20:21:04Z

Great job!

sryza · 2014-04-01T20:21:17Z

I'm OK with holding off on this for a separate JIRA, but for completeness I'd like to propose merging in CDH code as well. I believe that CDH4.2.2 would be the best choice of version, as it maintains 2-compatibility with 0.20.2 and 2.2.0.

aarondav · 2014-04-01T20:26:54Z

Guys, I just realized that this includes a really cool submodule called "Map Reduce" which seems to be a generalized form of Spark! It also has around 3x the amount of code, and that's in Java (which is somewhat more concise than Scala), so I estimate it's roughly 5x as awesome as Spark. Are there plans for deprecating the Spark API in favor of "Map Reduce"?

If we deprecate Spark in 0.9.1 and replace it with Map Reduce in 1.0.0, that should give users enough time to migrate to the new API (which is much simpler -- only two functions!).

mattf · 2014-04-01T21:15:37Z

+1, lgtm, but only today!

mateiz · 2014-04-01T22:45:39Z

Wait, so did it pass or fail Jenkins tests? Jenkins isn't saying what happened, only that the build finished.

Anyway I'd be okay with this as long as you make it meet our code style guidelines.

CrazyJvm · 2014-04-02T00:47:48Z

amazing! I've been looking forward it for a long time!

DavyLin · 2014-04-02T01:31:03Z

I do not know, ah, but I'm looking forward to it

liancheng · 2014-04-02T01:33:06Z

This is a good start! Why not merge all transitive dependencies into Spark altogether? In this way we can save lots of efforts on SBT/Maven, and avoid time consuming Ivy dependency resolution.

You know, ever since those .orbit things were added, sbt update costs nearly 40 minutes to resolve all the dependencies in mainland China... I have to turn it off by skip in update := true and offline := true.

OK, the 2nd paragraph is not part of the joke :-)

yinxusen · 2014-04-02T01:44:02Z

Yep, I find that each time I do sbt clean gen-idea or sbt update or even sbt testOnly xxx, I can do the cooking, take a shower, and have a rest.

pelick · 2014-04-02T01:48:45Z

Is it true about this => “ Wait, hold on, is hadoop-0.20.2 the same as hadoop-2.2.0? I'm assuming it's the same because they both have the same number of "2"s. ”

win2cs · 2014-04-02T02:00:02Z

"But will this work on YARN?", I've the same question.
And does it still support specifying other Hadoop distribution like before?

zsxwing · 2014-04-02T02:57:41Z

You know, ever since those .orbit things were added, sbt update costs nearly 40 minutes to resolve all the dependencies in mainland China... I have to turn it off by skip in update := true and offline := true.

+1. Really terrible nightmare. If the version of some jar is 6.4, it usually can not be downloaded in mainland China without a proxy :(

JIRA issue: [SPARK-1402](https://issues.apache.org/jira/browse/SPARK-1402) This PR provides 3 more compression schemes for Spark SQL in-memory columnar storage: * `BooleanBitSet` * `IntDelta` * `LongDelta` Now there are 6 compression schemes in total, including the no-op `PassThrough` scheme. Also fixed a bug in PR #286: not all compression schemes are added as available schemes when accessing an in-memory column, and when a column is compressed with an unrecognised scheme, `ColumnAccessor` throws exception. Author: Cheng Lian <[email protected]> Closes #330 from liancheng/moreCompressionSchemes and squashes the following commits: 1d037b8 [Cheng Lian] Fixed SPARK-1436: in-memory column byte buffer must be able to be accessed multiple times d7c0e8f [Cheng Lian] Added test suite for IntegralDelta (IntDelta & LongDelta) 3c1ad7a [Cheng Lian] Added test suite for BooleanBitSet, refactored other test suites 44fe4b2 [Cheng Lian] Refactored CompressionScheme, added 3 more compression schemes.

JIRA issue: [SPARK-1402](https://issues.apache.org/jira/browse/SPARK-1402) This PR provides 3 more compression schemes for Spark SQL in-memory columnar storage: * `BooleanBitSet` * `IntDelta` * `LongDelta` Now there are 6 compression schemes in total, including the no-op `PassThrough` scheme. Also fixed a bug in PR apache#286: not all compression schemes are added as available schemes when accessing an in-memory column, and when a column is compressed with an unrecognised scheme, `ColumnAccessor` throws exception. Author: Cheng Lian <[email protected]> Closes apache#330 from liancheng/moreCompressionSchemes and squashes the following commits: 1d037b8 [Cheng Lian] Fixed SPARK-1436: in-memory column byte buffer must be able to be accessed multiple times d7c0e8f [Cheng Lian] Added test suite for IntegralDelta (IntDelta & LongDelta) 3c1ad7a [Cheng Lian] Added test suite for BooleanBitSet, refactored other test suites 44fe4b2 [Cheng Lian] Refactored CompressionScheme, added 3 more compression schemes.

* Replace submission v1 with submission v2. * Address documentation changes. * Fix documentation

…public-clouds-job Refactor docker machine job to run specific tests set and add cleanup machanism

…l as Spark uses Hive-1.2 (apache#286)

### What changes were proposed in this pull request? This PR upgrades `GenJavadoc` plugin from `0.17` to `0.18`. ### Why are the changes needed? `0.18` includes a bug fix for `Scala 2.13`. ``` This release fixes a bug (#286) with Scala 2.13.6 in relation with deprecated annotations in Scala sources leading to a NoSuchElementException in some cases. ``` https://github.com/lightbend/genjavadoc/releases/tag/v0.18 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc for Scala 2.13. ``` build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc ``` Closes #33383 from sarutak/upgrade-genjavadoc-0.18. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? This PR upgrades `GenJavadoc` plugin from `0.17` to `0.18`. ### Why are the changes needed? `0.18` includes a bug fix for `Scala 2.13`. ``` This release fixes a bug (#286) with Scala 2.13.6 in relation with deprecated annotations in Scala sources leading to a NoSuchElementException in some cases. ``` https://github.com/lightbend/genjavadoc/releases/tag/v0.18 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc for Scala 2.13. ``` build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc ``` Closes #33383 from sarutak/upgrade-genjavadoc-0.18. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit ad744fb) Signed-off-by: Dongjoon Hyun <[email protected]>

pwendell closed this Apr 2, 2014

liancheng mentioned this pull request Apr 5, 2014

[SPARK-1402] Added 3 more compression schemes #330

Closed

lins05 pushed a commit to lins05/spark that referenced this pull request May 30, 2017

Replace submission v1 with submission v2. (apache#286)

8f3d965

* Replace submission v1 with submission v2. * Address documentation changes. * Fix documentation

erikerlandson pushed a commit to erikerlandson/spark that referenced this pull request Jul 28, 2017

Replace submission v1 with submission v2. (apache#286)

cc5eb85

* Replace submission v1 with submission v2. * Address documentation changes. * Fix documentation

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#286 from liu-sheng/refactor-docker-machine-…

8813f70

…public-clouds-job Refactor docker machine job to run specific tests set and add cleanup machanism

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-214] Hive-2.1 poperties can't be read from a hive-site.xm…

9a0e2da

…l as Spark uses Hive-1.2 (apache#286)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge Hadoop Into Spark #286

Merge Hadoop Into Spark #286

pwendell commented Apr 1, 2014

rxin commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

aarondav commented Apr 1, 2014

mengxr commented Apr 1, 2014

shivaram commented Apr 1, 2014

mridulm commented Apr 1, 2014

andrewor14 commented Apr 1, 2014

CodingCat commented Apr 1, 2014

sryza commented Apr 1, 2014

aarondav commented Apr 1, 2014

mattf commented Apr 1, 2014

mateiz commented Apr 1, 2014

CrazyJvm commented Apr 2, 2014

DavyLin commented Apr 2, 2014

liancheng commented Apr 2, 2014

yinxusen commented Apr 2, 2014

pelick commented Apr 2, 2014

win2cs commented Apr 2, 2014

zsxwing commented Apr 2, 2014

Merge Hadoop Into Spark #286

Merge Hadoop Into Spark #286

Conversation

pwendell commented Apr 1, 2014

More source code

Seamless builds

Your favorite old configs

rxin commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

AmplabJenkins commented Apr 1, 2014

aarondav commented Apr 1, 2014

mengxr commented Apr 1, 2014

shivaram commented Apr 1, 2014

mridulm commented Apr 1, 2014

andrewor14 commented Apr 1, 2014

CodingCat commented Apr 1, 2014

sryza commented Apr 1, 2014

aarondav commented Apr 1, 2014

mattf commented Apr 1, 2014

mateiz commented Apr 1, 2014

CrazyJvm commented Apr 2, 2014

DavyLin commented Apr 2, 2014

liancheng commented Apr 2, 2014

yinxusen commented Apr 2, 2014

pelick commented Apr 2, 2014

win2cs commented Apr 2, 2014

zsxwing commented Apr 2, 2014