Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7481] [build] Add spark-cloud module to pull in object store access, + documentation #12004

Closed

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Mar 28, 2016

What changes were proposed in this pull request?

Add a new cloud module and maven profile to pull in object store support from hadoop-openstack, hadoop-aws and hadoop-azure JARs, along with their dependencies.

As a result, it restores s3n:// access to S3, adds its s3a:// replacement , OpenStack swift:// and azure wasb://.

The patch defines a new module, spark-cloud with transitive dependencies on the amazon and azure JARs. The spark assembly module JAR depends on this module to pull the relevant hadoop and dependent JARs into SPARK_HOME/jars. This module is only included in a build if the -cloud profile is set; there's an implicit dependency on the hadoop-2.7 module too, because that is where the various hadoop-$CLOUD modules are imported.

There's a documentation page, cloud_integration.md, which covers the details of using Spark with object stores: API level, the different object stores, limitations, recommended tuning options. All the examples are in Scala. The docs try to be an introduction which avoids cut-and-paste text which will age fast against different versions, pointing at the Hadoop docs.

Having an explicit module enables downstream applications to declare their dependency upon spark-cloud and get object store access. This is used for the integration tests, which therefore implicitly test the correctness of the declared dependencies.

There is no import or coverage of the google cloud storage gs:// storage; that's not an ASF project & we've actually had some problems with gs; at the time of writing it hasn't implemented the standard Hadoop FS contract tests, so it's not clear what state is in. That could be future work.

There is a mention but no details on Amazon EMR's S3:// client. It's essentially says "Amazon's code, their problem", which is the case.

How was this patch tested?

There are no functional tests in this module; it is left to downstream projects to implement integration tests. I have such tests in https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples

The module has been tested against: Hadoop 2.7.3, branch-2.8 (2.8.0-SNAPSHOT), branch-2 (2.9.0-SNAPSHOT) and Hadoop trunk;

Tested endpoints: AWS S3 US-East, AWS S3 Ireland, Azure (US and EU), and Rackspace Swift US and EU,

@steveloughran
Copy link
Contributor Author

Note that as this patch is is playing with the maven build and the hadoop-2.6 and hadoop-2.7 profiles, the SparkQA builds aren't going to pick up on much here.

cloud/pom.xml Outdated
<profiles>

<!--
This profile is enabled automatically by the sbt built. It changes the scope for the guava
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: By the sbt build.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

funny: that was a comment I lifted with the dependency cargo cult style; it'll need fixing in in the original too..

@SparkQA
Copy link

SparkQA commented Mar 28, 2016

Test build #54333 has finished for PR 12004 at commit 5e9cfbe.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

test failures are in hive; unrelated

@SparkQA
Copy link

SparkQA commented Mar 29, 2016

Test build #54454 has finished for PR 12004 at commit 72b3548.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DependencyCheckSuite extends SparkFunSuite

@steveloughran
Copy link
Contributor Author

build failing as SBT needs to be conditional on the spark/cloud module being Hadoop 2.6+

@steveloughran steveloughran changed the title [SPARK-7481][build][WIP] Add Hadoop 2.6+ profile to pull in object store FS accesors [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark-cloud module to pull in object store access, plus tests Mar 30, 2016
@SparkQA
Copy link

SparkQA commented Mar 30, 2016

Test build #54525 has finished for PR 12004 at commit 6beafb5.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 3, 2016

Test build #54803 has finished for PR 12004 at commit 6487d93.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 15, 2016

Test build #55916 has finished for PR 12004 at commit 105de0b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 18, 2016

Test build #56092 has finished for PR 12004 at commit ce48b8c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56360 has finished for PR 12004 at commit 8845af0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56390 has finished for PR 12004 at commit 2fca815.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

The latest version of this does, among other things, call FileSystem.toString after operations. In HADOOP-13028, along with seek optimisation, S3aFileSystem.toString() now dumps all the statistics to date. This means that the aggregate state of all test runs are displayed; if you run a specific test standalone you can see the stats purely for that test

Here's a test with the maven args -Phadoop-2.7 -DwildcardSuites=org.apache.spark.cloud.s3.S3aIOSuite -Dcloud.test.configuration.file=../cloud.xml -Dhadoop.version=2.9.0-SNAPSHOT -Dtest.method.keys=CSVgz

2016-04-26 14:32:17,104 INFO  scheduler.TaskSetManager (Logging.scala:logInfo(54)) - Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 5261 bytes)
2016-04-26 14:32:17,105 INFO  executor.Executor (Logging.scala:logInfo(54)) - Running task 0.0 in stage 0.0 (TID 0)
2016-04-26 14:32:17,111 INFO  rdd.HadoopRDD (Logging.scala:logInfo(54)) - Input split: s3a://landsat-pds/scene_list.gz:0+20430493
2016-04-26 14:32:17,285 INFO  compress.CodecPool (CodecPool.java:getDecompressor(181)) - Got brand-new decompressor [.gz]
2016-04-26 14:32:21,724 INFO  executor.Executor (Logging.scala:logInfo(54)) - Finished task 0.0 in stage 0.0 (TID 0). 2643 bytes result sent to driver
2016-04-26 14:32:21,727 INFO  scheduler.TaskSetManager (Logging.scala:logInfo(54)) - Finished task 0.0 in stage 0.0 (TID 0) in 4625 ms on localhost (1/1)
2016-04-26 14:32:21,727 INFO  scheduler.TaskSchedulerImpl (Logging.scala:logInfo(54)) - Removed TaskSet 0.0, whose tasks have all completed, from pool 
2016-04-26 14:32:21,728 INFO  scheduler.DAGScheduler (Logging.scala:logInfo(54)) - ResultStage 0 (count at S3aIOSuite.scala:127) finished in 4.626 s
2016-04-26 14:32:21,728 INFO  scheduler.DAGScheduler (Logging.scala:logInfo(54)) - Job 0 finished: count at S3aIOSuite.scala:127, took 4.636417 s
2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) -  size of s3a://landsat-pds/scene_list.gz = 464105 rows read in 4815885000 nS
2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) - Filesystem statistics S3AFileSystem{uri=s3a://landsat-pds, workingDir=s3a://landsat-pds/user/stevel, partSize=104857600, enableMultiObjectsDelete=true, multiPartThreshold=2147483647, statistics {40864879 bytes read, 7786 bytes written, 110 read ops, 0 large read ops, 26 write ops}, metrics {{Context=S3AFileSystem} {FileSystemId=bc5db77d-e17d-41bb-88ab-44b26cf3eda4-landsat-pds} {fsURI=s3a://landsat-pds/scene_list.gz} {files_created=0} {files_copied=0} {files_copied_bytes=0} {files_deleted=0} {directories_created=0} {directories_deleted=0} {ignored_errors=0} {streamForwardSeekOperations=0} {streamCloseOperations=2} {streamBytesSkippedOnSeek=0} {streamReadOperations=2821} {streamReadExceptions=0} {streamAborted=0} {streamBackwardSeekOperations=0} {streamClosed=2} {streamOpened=2} {streamSeekOperations=0} {streamBytesRead=40860986} {streamReadOperationsIncomplete=2821} {streamReadFullyOperations=0} }}
2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) - 

@steveloughran
Copy link
Contributor Author

Oh, and there's an initial documentation page on spark + cloud infrastructure, which tries to make clear that object stores are not real filesystems

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #56998 has finished for PR 12004 at commit 8926acb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Note that as this module only builds on Hadoop >= 2.6; jenkins won't be compiling it. The tests are designed to skip running if no config file to cloud infrastructure has been provided.

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #57003 has finished for PR 12004 at commit 4e4e941.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 4, 2016

Test build #57782 has finished for PR 12004 at commit 70065dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

For anyone trying to run these tests, they'll need a test xml file and refer to it

mvn test -Phadoop-2.6 -Dcloud.test.configuration.file=../cloud.xml 

The referenced file uses XInclude to input the AWS credentials which I keep a long way away from SCM-managed directories.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~  or more contributor license agreements.  See the NOTICE file
  ~  distributed with this work for additional information
  ~  regarding copyright ownership.  The ASF licenses this file
  ~  to you under the Apache License, Version 2.0 (the
  ~  "License"); you may not use this file except in compliance
  ~  with the License.  You may obtain a copy of the License at
  ~
  ~       http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~  Unless required by applicable law or agreed to in writing, software
  ~  distributed under the License is distributed on an "AS IS" BASIS,
  ~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  ~  See the License for the specific language governing permissions and
  ~  limitations under the License.
  -->

<configuration>
  <include xmlns="http://www.w3.org/2001/XInclude"
    href="file:///home/stevel/.aws/keys.xml"/>

  <property>
    <name>aws.tests.enabled</name>
    <value>true</value>
  </property>

  <property>
    <name>s3a.test.uri</name>
    <value>s3a://test-eu1</value>
  </property>
</configuration>

All the test suites will be designed to run iff the relevant enabled.flag is set; this is why there's a new method to declare tests, ctest(key: String, summary: String, detail: String)(testFun: => Unit): Unit

these tests are not only conditional on the suite being enabled, they each have a key which can be explicitly named from the build in the test.method.keys attr. This allows explicit methods to be named the way the current maven surefire runner doesn't; the time it can take to run individual tests makes this feature invaluable during iterative development.

  ctest("CSVgz", "Read compressed CSV",
    "Read compressed CSV files through the spark context") {
    val source = SceneList
    sc = new SparkContext("local", "test", newSparkConf(source))
    val sceneInfo = getFS(source).getFileStatus(source)
    logInfo(s"Compressed size = ${sceneInfo.getLen}")
    validateCSV(sc, source)
    logInfo(s"Filesystem statistics ${getFS(source)}")
  }

For best performance, you need a build of hadoop which has as much of HADOOP-11694 applied, especially HADOOP-12444, lazy seek (in branch-2.8 already), and HADOOP-13028

mvn test -Phadoop-2.7 -DwildcardSuites=org.apache.spark.cloud.s3.S3aIOSuite  \
-Dcloud.test.configuration.file=../cloud.xml \
-Dhadoop.version=2.9.0-SNAPSHOT \
-Dtest.method.keys=CSVgz

@SparkQA
Copy link

SparkQA commented May 19, 2016

Test build #58856 has finished for PR 12004 at commit 4e37c7a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

steveloughran and others added 14 commits March 20, 2017 14:15
…CP is set up consistently for the later versions of the AWS SDK
…ct references to AWS classes failing. Cut them and rely on transitive load through FS class instantation to force the load. All that happens is that failures to link will be slightly less easy to debug.
…ack JARs, transitive dependencies on aws and azure SDKs
Change-Id: I3dea2544f089615493163f0fae482992873f9c35
Change-Id: Ibd6d40df2bc8a2edf19a058c458bea233ba414fd
… as well as 2.7, keeping azure the 2.7.x dependency. All dependencies are scoped @ hadoop.scope

Change-Id: I80bd95fd48e21cf2eb4d94907ac99081cd3bd375
@steveloughran
Copy link
Contributor Author

steveloughran commented Mar 20, 2017

The latest patch embraces the fact that 2.6 is the base hadoop version so the hadoop-aws JAR is always pulled in, dependencies set up. One thing to bear in mind here that the [Phase I fixes|https://issues.apache.org/jira/browse/HADOOP-11571] aren't in there, And s3a absolutely must not be used in production, the big killers being:

  • HADOOP-11570 closing the stream reads to the EOF, which means every seek() can read in 2x file size.
  • HADOOP-11584 block size returned in getFileStatus() ==0. That is bad because both Pig and Spark use that block size in partitioning, so will split up a file into single byte partitions: 20MB file, 210^7 tasks. Each of which will open the file at byte (0), then call seek to offset, then close(). As a result, 210e7 * tasks reading 2* 2 2 * 10e7 bytes. This is generally considered "pathologically suboptimal". I've had to modify my downstream tests to recognise when the block size of a file ==0 and skip those tests.

s3n will work; in 2.6 it moved to the aws JAR, so reinstate the functionality which was in spark builds against hadoop 2.2-2.5

Between this patch and the previous one, I had to try to switch spark-hadoop-cloud to move to a POM-only artifact. Doesn't work for downstream projects: they kept getting the 2.6.5 artifacts irrespective of which profile was used at build time. That negates the whole goal of the module "give consistent access to the hadoop cloud modules". This patch produces a minimal JAR for this reason

@SparkQA
Copy link

SparkQA commented Mar 20, 2017

Test build #74899 has finished for PR 12004 at commit 83d9368.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

Any comments on the latest patch? Anyone?

@steveloughran
Copy link
Contributor Author

@srowen anything else I need to do here?

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I've timed out on this and am not going to review it further. I don't think this is worth this much effort and discussion.

<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.11</artifactId>
<version>2.2.0-SNAPSHOT</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to make this 2.3.0-SNAPSHOT now, because that's what's correct for master, then change it if it back-ports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd noticed that this morning....

The imports of transitive dependencies are managed to make them consistent
with those of the Spark build.

WARNING: the signatures of methods in the AWS and Azure SDKs do change between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only include the first sentence here. The description here should be short since nobody will likely read it. Anything substantive could go in docs.

Pull in spark-hadoop-cloud and its associated JARs,
-->
<profile>
<id>cloud</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this hadoop-cloud perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so org/apache/spark + hadoop-cloud? I'll cause too much confusion were any JAR created thrown into a lib/ directory; you'd get

hadoop-aws-2.8.1.jar
spark-core-2.3.0
hadoop-cloud-2.3.0

& people would be trying to understand why the hadoop-* was out of sync, who to ping, etc.

There's actually a hadoop-cloudproject POM coming in hadoop-trunk to try and be a one-stop-dependency for all cloud bindings (avoiding the ongoing "declare new dependencies per version"). the names are way too close.

I'd had it as spark-cloud, you'd felt spark-hadoop-cloud was better. I can't think of what else would do, but I do think spark- is the string which should go at the front

<profiles>

<profile>
<id>hadoop-2.7</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only needs to come in for Hadoop 2.7+, not 2.6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

  • 2.7 adds hadoop-azure for wasb:
  • 2.8 adds hadoop-azure-datalake for adl:

There's going to be an aggregate POM in trunk, hadoop-cloud-storage, which declares all transitive stuff, ideally stripping down cruft we don't need. That way if new things go in, anything pulling that JAR shouldn't have to add new declarations. There's still the problem of transitive breakage of JARs (i.e. Jackson)

## <a name="introduction"></a>Introduction


All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Google is "Google Cloud Platform"? or maybe just say "All major public cloud infrastructure providers offer ..."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do "All major cloud providers offer persistent data storage in object stores."

```

This reads from the object in blocks, which is efficient when seeking backwards as well as
forwards in a file —at the expense of making full file reads slower.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the dash here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, cut that whole section. Left to the Hadoop docs


### Directory Operations May be Slow and Non-atomic

Directory rename and delete may be performed as a series of operations. Specifically, recursive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all true, but is it relevant to a Spark app? maybe, if it's also using HDFS APIs directly. If you revise this bit, consider emphasizing how this reality affects Spark usage in particular.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I'll call that out up front

  1. Reading and writing data can be slower than expected.
  2. Some directory structures may be very inefficient to scan during query split calculation.
  3. The output of saved RDD may not be immediately visible to a follow-on query.
  4. The internal mechanism by which Spark commits work when saving an RDD is potentially
    both slow and unreliable.

That's essentially it on day to day use.


### Data is Not Written Until the OutputStream's `close()` Operation.

Data written to the object store is often buffered to a local file or stored in memory,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section I'm less clear is relevant to a Spark app. Anything's possible but it's rare that someone would write a stream directly (?) Dunno, it's not bad info, just want to balance writing and maintaining this info in Spark docs vs pointing to other resources with a summary of key points to know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've cut it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(FWIW, where this does cause problems is that really slow writes can block things like heartbeating protocols if the same thread is doing the heartbeat. Hasn't surfaced in spark AFAIK.

@@ -621,6 +621,11 @@
<version>${fasterxml.jackson.version}</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-cbor</artifactId>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what drives this -- is it just another Jackson component whose version has to be harmonized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, keeping jackson in sync is a key breakage point. Declaring it in the root POM doesn't add it everywhere, it just delares it so that the cloud POM can exclude the one which comes via the aws-java-sdk-s3 dependency JARs and pick up the one used in spark. The (later) Spark one is compatible with the one aws SDK depends on, so moving up works...it's just that all jackson bits needs to be in sync, and there's no way in Maven or Ivy to declare that fact.

the AWS module pulls in jackson; its transitive dependencies can create
intra-jackson-module version problems.
-->
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't all this only be in cloud/pom.xml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm declaring them here purely as declarations, so there's one place where all the hadoop dependencies go, versions declared etc. The cloud POM explicitly adds them to the module dependency, so that's the only place where they get pulled into the build. It's similar to what's done with Hive: spark-hive declares the actual dependencies, but the root pom declares what hive version to pull in, and what stuff to leave out.

@steveloughran
Copy link
Contributor Author

github isn't letting me reopen this, so I'm going to submit the patch with reworked docs as a new PR. The machines do not like me today.

asfgit pushed a commit that referenced this pull request May 7, 2017
…tore access.

## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to #12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes #17834 from steveloughran/cloud/SPARK-7481-current.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…tore access.

## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to apache#12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.
jisookim0513 pushed a commit to metamx/spark that referenced this pull request Aug 28, 2017
…tore access.

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to apache#12004; I can't re-open it)

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.
isimonenko pushed a commit to metamx/spark that referenced this pull request Oct 17, 2017
…tore access.

## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to apache#12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.

(cherry picked from commit 2cf83c4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants