[SPARK-7481] [build] Add spark-cloud module to pull in object store access, + documentation #12004

steveloughran · 2016-03-28T17:57:00Z

What changes were proposed in this pull request?

Add a new cloud module and maven profile to pull in object store support from hadoop-openstack, hadoop-aws and hadoop-azure JARs, along with their dependencies.

As a result, it restores s3n:// access to S3, adds its s3a:// replacement , OpenStack swift:// and azure wasb://.

The patch defines a new module, spark-cloud with transitive dependencies on the amazon and azure JARs. The spark assembly module JAR depends on this module to pull the relevant hadoop and dependent JARs into SPARK_HOME/jars. This module is only included in a build if the -cloud profile is set; there's an implicit dependency on the hadoop-2.7 module too, because that is where the various hadoop-$CLOUD modules are imported.

There's a documentation page, cloud_integration.md, which covers the details of using Spark with object stores: API level, the different object stores, limitations, recommended tuning options. All the examples are in Scala. The docs try to be an introduction which avoids cut-and-paste text which will age fast against different versions, pointing at the Hadoop docs.

Having an explicit module enables downstream applications to declare their dependency upon spark-cloud and get object store access. This is used for the integration tests, which therefore implicitly test the correctness of the declared dependencies.

There is no import or coverage of the google cloud storage gs:// storage; that's not an ASF project & we've actually had some problems with gs; at the time of writing it hasn't implemented the standard Hadoop FS contract tests, so it's not clear what state is in. That could be future work.

There is a mention but no details on Amazon EMR's S3:// client. It's essentially says "Amazon's code, their problem", which is the case.

How was this patch tested?

There are no functional tests in this module; it is left to downstream projects to implement integration tests. I have such tests in https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples

The module has been tested against: Hadoop 2.7.3, branch-2.8 (2.8.0-SNAPSHOT), branch-2 (2.9.0-SNAPSHOT) and Hadoop trunk;

Tested endpoints: AWS S3 US-East, AWS S3 Ireland, Azure (US and EU), and Rackspace Swift US and EU,

steveloughran · 2016-03-28T18:26:14Z

Note that as this patch is is playing with the maven build and the hadoop-2.6 and hadoop-2.7 profiles, the SparkQA builds aren't going to pick up on much here.

nchammas · 2016-03-28T18:46:07Z

cloud/pom.xml

+  <profiles>
+
+    <!--
+      This profile is enabled automatically by the sbt built. It changes the scope for the guava


Typo: By the sbt build.

funny: that was a comment I lifted with the dependency cargo cult style; it'll need fixing in in the original too..

SparkQA · 2016-03-28T19:48:12Z

Test build #54333 has finished for PR 12004 at commit 5e9cfbe.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-03-29T09:12:58Z

test failures are in hive; unrelated

SparkQA · 2016-03-29T21:03:50Z

Test build #54454 has finished for PR 12004 at commit 72b3548.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DependencyCheckSuite extends SparkFunSuite

steveloughran · 2016-03-30T13:09:49Z

build failing as SBT needs to be conditional on the spark/cloud module being Hadoop 2.6+

SparkQA · 2016-03-30T15:18:27Z

Test build #54525 has finished for PR 12004 at commit 6beafb5.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-03T12:49:22Z

Test build #54803 has finished for PR 12004 at commit 6487d93.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-15T11:06:50Z

Test build #55916 has finished for PR 12004 at commit 105de0b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-18T19:41:40Z

Test build #56092 has finished for PR 12004 at commit ce48b8c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-20T16:14:58Z

Test build #56360 has finished for PR 12004 at commit 8845af0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-20T20:40:34Z

Test build #56390 has finished for PR 12004 at commit 2fca815.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-04-26T13:38:48Z

The latest version of this does, among other things, call FileSystem.toString after operations. In HADOOP-13028, along with seek optimisation, S3aFileSystem.toString() now dumps all the statistics to date. This means that the aggregate state of all test runs are displayed; if you run a specific test standalone you can see the stats purely for that test

Here's a test with the maven args -Phadoop-2.7 -DwildcardSuites=org.apache.spark.cloud.s3.S3aIOSuite -Dcloud.test.configuration.file=../cloud.xml -Dhadoop.version=2.9.0-SNAPSHOT -Dtest.method.keys=CSVgz

2016-04-26 14:32:17,104 INFO  scheduler.TaskSetManager (Logging.scala:logInfo(54)) - Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 5261 bytes)
2016-04-26 14:32:17,105 INFO  executor.Executor (Logging.scala:logInfo(54)) - Running task 0.0 in stage 0.0 (TID 0)
2016-04-26 14:32:17,111 INFO  rdd.HadoopRDD (Logging.scala:logInfo(54)) - Input split: s3a://landsat-pds/scene_list.gz:0+20430493
2016-04-26 14:32:17,285 INFO  compress.CodecPool (CodecPool.java:getDecompressor(181)) - Got brand-new decompressor [.gz]
2016-04-26 14:32:21,724 INFO  executor.Executor (Logging.scala:logInfo(54)) - Finished task 0.0 in stage 0.0 (TID 0). 2643 bytes result sent to driver
2016-04-26 14:32:21,727 INFO  scheduler.TaskSetManager (Logging.scala:logInfo(54)) - Finished task 0.0 in stage 0.0 (TID 0) in 4625 ms on localhost (1/1)
2016-04-26 14:32:21,727 INFO  scheduler.TaskSchedulerImpl (Logging.scala:logInfo(54)) - Removed TaskSet 0.0, whose tasks have all completed, from pool 
2016-04-26 14:32:21,728 INFO  scheduler.DAGScheduler (Logging.scala:logInfo(54)) - ResultStage 0 (count at S3aIOSuite.scala:127) finished in 4.626 s
2016-04-26 14:32:21,728 INFO  scheduler.DAGScheduler (Logging.scala:logInfo(54)) - Job 0 finished: count at S3aIOSuite.scala:127, took 4.636417 s
2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) -  size of s3a://landsat-pds/scene_list.gz = 464105 rows read in 4815885000 nS
2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) - Filesystem statistics S3AFileSystem{uri=s3a://landsat-pds, workingDir=s3a://landsat-pds/user/stevel, partSize=104857600, enableMultiObjectsDelete=true, multiPartThreshold=2147483647, statistics {40864879 bytes read, 7786 bytes written, 110 read ops, 0 large read ops, 26 write ops}, metrics {{Context=S3AFileSystem} {FileSystemId=bc5db77d-e17d-41bb-88ab-44b26cf3eda4-landsat-pds} {fsURI=s3a://landsat-pds/scene_list.gz} {files_created=0} {files_copied=0} {files_copied_bytes=0} {files_deleted=0} {directories_created=0} {directories_deleted=0} {ignored_errors=0} {streamForwardSeekOperations=0} {streamCloseOperations=2} {streamBytesSkippedOnSeek=0} {streamReadOperations=2821} {streamReadExceptions=0} {streamAborted=0} {streamBackwardSeekOperations=0} {streamClosed=2} {streamOpened=2} {streamSeekOperations=0} {streamBytesRead=40860986} {streamReadOperationsIncomplete=2821} {streamReadFullyOperations=0} }}
2016-04-26 14:32:21,729 INFO  s3.S3aIOSuite (Logging.scala:logInfo(54)) -

steveloughran · 2016-04-26T13:58:11Z

Oh, and there's an initial documentation page on spark + cloud infrastructure, which tries to make clear that object stores are not real filesystems

SparkQA · 2016-04-26T15:33:57Z

Test build #56998 has finished for PR 12004 at commit 8926acb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-04-26T16:26:24Z

Note that as this module only builds on Hadoop >= 2.6; jenkins won't be compiling it. The tests are designed to skip running if no config file to cloud infrastructure has been provided.

SparkQA · 2016-04-26T18:01:06Z

Test build #57003 has finished for PR 12004 at commit 4e4e941.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-05-04T20:15:16Z

Test build #57782 has finished for PR 12004 at commit 70065dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2016-05-06T19:26:12Z

For anyone trying to run these tests, they'll need a test xml file and refer to it

mvn test -Phadoop-2.6 -Dcloud.test.configuration.file=../cloud.xml

The referenced file uses XInclude to input the AWS credentials which I keep a long way away from SCM-managed directories.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~  or more contributor license agreements.  See the NOTICE file
  ~  distributed with this work for additional information
  ~  regarding copyright ownership.  The ASF licenses this file
  ~  to you under the Apache License, Version 2.0 (the
  ~  "License"); you may not use this file except in compliance
  ~  with the License.  You may obtain a copy of the License at
  ~
  ~       http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~  Unless required by applicable law or agreed to in writing, software
  ~  distributed under the License is distributed on an "AS IS" BASIS,
  ~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  ~  See the License for the specific language governing permissions and
  ~  limitations under the License.
  -->

<configuration>
  <include xmlns="http://www.w3.org/2001/XInclude"
    href="file:///home/stevel/.aws/keys.xml"/>

  <property>
    <name>aws.tests.enabled</name>
    <value>true</value>
  </property>

  <property>
    <name>s3a.test.uri</name>
    <value>s3a://test-eu1</value>
  </property>
</configuration>

All the test suites will be designed to run iff the relevant enabled.flag is set; this is why there's a new method to declare tests, ctest(key: String, summary: String, detail: String)(testFun: => Unit): Unit

these tests are not only conditional on the suite being enabled, they each have a key which can be explicitly named from the build in the test.method.keys attr. This allows explicit methods to be named the way the current maven surefire runner doesn't; the time it can take to run individual tests makes this feature invaluable during iterative development.

  ctest("CSVgz", "Read compressed CSV",
    "Read compressed CSV files through the spark context") {
    val source = SceneList
    sc = new SparkContext("local", "test", newSparkConf(source))
    val sceneInfo = getFS(source).getFileStatus(source)
    logInfo(s"Compressed size = ${sceneInfo.getLen}")
    validateCSV(sc, source)
    logInfo(s"Filesystem statistics ${getFS(source)}")
  }

For best performance, you need a build of hadoop which has as much of HADOOP-11694 applied, especially HADOOP-12444, lazy seek (in branch-2.8 already), and HADOOP-13028

mvn test -Phadoop-2.7 -DwildcardSuites=org.apache.spark.cloud.s3.S3aIOSuite  \
-Dcloud.test.configuration.file=../cloud.xml \
-Dhadoop.version=2.9.0-SNAPSHOT \
-Dtest.method.keys=CSVgz

SparkQA · 2016-05-19T08:54:27Z

Test build #58856 has finished for PR 12004 at commit 4e37c7a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…s; link to remaning docs from top level.

…CP is set up consistently for the later versions of the AWS SDK

…and docs

…anup

…ct references to AWS classes failing. Cut them and rely on transitive load through FS class instantation to force the load. All that happens is that failures to link will be slightly less easy to debug.

…ack JARs, transitive dependencies on aws and azure SDKs

…ranch-2, s3 ireland

Change-Id: I3dea2544f089615493163f0fae482992873f9c35

Change-Id: Ibd6d40df2bc8a2edf19a058c458bea233ba414fd

… as well as 2.7, keeping azure the 2.7.x dependency. All dependencies are scoped @ hadoop.scope Change-Id: I80bd95fd48e21cf2eb4d94907ac99081cd3bd375

steveloughran · 2017-03-20T21:54:42Z

The latest patch embraces the fact that 2.6 is the base hadoop version so the hadoop-aws JAR is always pulled in, dependencies set up. One thing to bear in mind here that the [Phase I fixes|https://issues.apache.org/jira/browse/HADOOP-11571] aren't in there, And s3a absolutely must not be used in production, the big killers being:

HADOOP-11570 closing the stream reads to the EOF, which means every seek() can read in 2x file size.
HADOOP-11584 block size returned in getFileStatus() ==0. That is bad because both Pig and Spark use that block size in partitioning, so will split up a file into single byte partitions: 20MB file, 210^7 tasks. Each of which will open the file at byte (0), then call seek to offset, then close(). As a result, 210e7 * tasks reading 2* 2 2 * 10e7 bytes. This is generally considered "pathologically suboptimal". I've had to modify my downstream tests to recognise when the block size of a file ==0 and skip those tests.

s3n will work; in 2.6 it moved to the aws JAR, so reinstate the functionality which was in spark builds against hadoop 2.2-2.5

Between this patch and the previous one, I had to try to switch spark-hadoop-cloud to move to a POM-only artifact. Doesn't work for downstream projects: they kept getting the 2.6.5 artifacts irrespective of which profile was used at build time. That negates the whole goal of the module "give consistent access to the hadoop cloud modules". This patch produces a minimal JAR for this reason

SparkQA · 2017-03-20T22:00:19Z

Test build #74899 has finished for PR 12004 at commit 83d9368.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

steveloughran · 2017-03-29T13:42:04Z

Any comments on the latest patch? Anyone?

steveloughran · 2017-04-08T12:30:29Z

@srowen anything else I need to do here?

srowen

Same, I've timed out on this and am not going to review it further. I don't think this is worth this much effort and discussion.

srowen · 2017-04-27T13:26:12Z

cloud/pom.xml

+  <parent>
+    <groupId>org.apache.spark</groupId>
+    <artifactId>spark-parent_2.11</artifactId>
+    <version>2.2.0-SNAPSHOT</version>


We may need to make this 2.3.0-SNAPSHOT now, because that's what's correct for master, then change it if it back-ports.

I'd noticed that this morning....

srowen · 2017-04-27T13:26:51Z

cloud/pom.xml

+    The imports of transitive dependencies are managed to make them consistent
+    with those of the Spark build.
+
+    WARNING: the signatures of methods in the AWS and Azure SDKs do change between


I would only include the first sentence here. The description here should be short since nobody will likely read it. Anything substantive could go in docs.

srowen · 2017-04-27T13:27:09Z

assembly/pom.xml

+     Pull in spark-hadoop-cloud and its associated JARs,
+    -->
+    <profile>
+      <id>cloud</id>


Call this hadoop-cloud perhaps?

so org/apache/spark + hadoop-cloud? I'll cause too much confusion were any JAR created thrown into a lib/ directory; you'd get

hadoop-aws-2.8.1.jar spark-core-2.3.0 hadoop-cloud-2.3.0

& people would be trying to understand why the hadoop-* was out of sync, who to ping, etc.

There's actually a hadoop-cloudproject POM coming in hadoop-trunk to try and be a one-stop-dependency for all cloud bindings (avoiding the ongoing "declare new dependencies per version"). the names are way too close.

I'd had it as spark-cloud, you'd felt spark-hadoop-cloud was better. I can't think of what else would do, but I do think spark- is the string which should go at the front

srowen · 2017-04-27T13:27:29Z

cloud/pom.xml

+  <profiles>
+
+    <profile>
+      <id>hadoop-2.7</id>


So this only needs to come in for Hadoop 2.7+, not 2.6?

yes

2.7 adds hadoop-azure for wasb:

2.8 adds hadoop-azure-datalake for adl:

There's going to be an aggregate POM in trunk, hadoop-cloud-storage, which declares all transitive stuff, ideally stripping down cruft we don't need. That way if new things go in, anything pulling that JAR shouldn't have to add new declarations. There's still the problem of transitive breakage of JARs (i.e. Jackson)

srowen · 2017-04-27T13:29:54Z

docs/cloud-integration.md

+## <a name="introduction"></a>Introduction
+
+
+All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer


Google is "Google Cloud Platform"? or maybe just say "All major public cloud infrastructure providers offer ..."

I'll do "All major cloud providers offer persistent data storage in object stores."

srowen · 2017-04-27T13:42:32Z

docs/cloud-integration.md

+```
+
+This reads from the object in blocks, which is efficient when seeking backwards as well as
+forwards in a file —at the expense of making full file reads slower.


same comment about the dash here

again, cut that whole section. Left to the Hadoop docs

srowen · 2017-04-27T13:45:37Z

docs/cloud-integration.md

+
+### Directory Operations May be Slow and Non-atomic
+
+Directory rename and delete may be performed as a series of operations. Specifically, recursive


This is all true, but is it relevant to a Spark app? maybe, if it's also using HDFS APIs directly. If you revise this bit, consider emphasizing how this reality affects Spark usage in particular.

good point. I'll call that out up front

Reading and writing data can be slower than expected.

Some directory structures may be very inefficient to scan during query split calculation.

The output of saved RDD may not be immediately visible to a follow-on query.

The internal mechanism by which Spark commits work when saving an RDD is potentially
both slow and unreliable.

That's essentially it on day to day use.

srowen · 2017-04-27T13:47:34Z

docs/cloud-integration.md

+
+### Data is Not Written Until the OutputStream's `close()` Operation.
+
+Data written to the object store is often buffered to a local file or stored in memory,


This section I'm less clear is relevant to a Spark app. Anything's possible but it's rare that someone would write a stream directly (?) Dunno, it's not bad info, just want to balance writing and maintaining this info in Spark docs vs pointing to other resources with a summary of key points to know.

I've cut it

(FWIW, where this does cause problems is that really slow writes can block things like heartbeating protocols if the same thread is doing the heartbeat. Hasn't surfaced in spark AFAIK.

srowen · 2017-04-27T13:49:48Z

pom.xml

@@ -621,6 +621,11 @@
        <version>${fasterxml.jackson.version}</version>
      </dependency>
      <dependency>
+        <groupId>com.fasterxml.jackson.dataformat</groupId>
+        <artifactId>jackson-dataformat-cbor</artifactId>


what drives this -- is it just another Jackson component whose version has to be harmonized?

yes, keeping jackson in sync is a key breakage point. Declaring it in the root POM doesn't add it everywhere, it just delares it so that the cloud POM can exclude the one which comes via the aws-java-sdk-s3 dependency JARs and pick up the one used in spark. The (later) Spark one is compatible with the one aws SDK depends on, so moving up works...it's just that all jackson bits needs to be in sync, and there's no way in Maven or Ivy to declare that fact.

srowen · 2017-04-27T13:50:14Z

pom.xml

+        the AWS module pulls in jackson; its transitive dependencies can create
+        intra-jackson-module version problems.
+        -->
+      <dependency>


Shouldn't all this only be in cloud/pom.xml?

I'm declaring them here purely as declarations, so there's one place where all the hadoop dependencies go, versions declared etc. The cloud POM explicitly adds them to the module dependency, so that's the only place where they get pulled into the build. It's similar to what's done with Hive: spark-hive declares the actual dependencies, but the root pom declares what hive version to pull in, and what stuff to leave out.

steveloughran · 2017-05-02T13:27:41Z

github isn't letting me reopen this, so I'm going to submit the patch with reworked docs as a new PR. The machines do not like me today.

…tore access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to #12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <[email protected]> Author: Steve Loughran <[email protected]> Closes #17834 from steveloughran/cloud/SPARK-7481-current.

…tore access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to apache#12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <[email protected]> Author: Steve Loughran <[email protected]> Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.

…tore access. Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to apache#12004; I can't re-open it) Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <[email protected]> Author: Steve Loughran <[email protected]> Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.

…tore access. ## What changes were proposed in this pull request? Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson. It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`. There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector. (this is the successor to apache#12004; I can't re-open it) ## How was this patch tested? Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples) Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well. Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile. SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package` maven build `mvn install -Phadoop-cloud -Phadoop-2.7` This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible. Author: Steve Loughran <[email protected]> Author: Steve Loughran <[email protected]> Closes apache#17834 from steveloughran/cloud/SPARK-7481-current. (cherry picked from commit 2cf83c4)

nchammas reviewed Mar 28, 2016
View reviewed changes

steveloughran changed the title ~~[SPARK-7481][build][WIP] Add Hadoop 2.6+ profile to pull in object store FS accesors~~ [SPARK-7481][build][WIP] Add Hadoop 2.6+ spark-cloud module to pull in object store access, plus tests Mar 30, 2016

steveloughran force-pushed the features/SPARK-7481-cloud branch from 72b3548 to 6beafb5 Compare March 30, 2016 15:06

steveloughran force-pushed the features/SPARK-7481-cloud branch from 6beafb5 to 6487d93 Compare April 3, 2016 10:54

steveloughran mentioned this pull request Apr 7, 2016

[SPARK-10063][SQL] Remove DirectParquetOutputCommitter #12229

Closed

steveloughran force-pushed the features/SPARK-7481-cloud branch from 6487d93 to 105de0b Compare April 15, 2016 09:30

steveloughran force-pushed the features/SPARK-7481-cloud branch from 105de0b to ce48b8c Compare April 18, 2016 09:16

steveloughran mentioned this pull request Apr 19, 2016

[SPARK-14678][SQL]Add a file sink log to support versioning and compaction #12435

Closed

steveloughran force-pushed the features/SPARK-7481-cloud branch from 8845af0 to 2fca815 Compare April 20, 2016 18:30

steveloughran force-pushed the features/SPARK-7481-cloud branch from 4e4e941 to 70065dc Compare May 4, 2016 18:27

steveloughran force-pushed the features/SPARK-7481-cloud branch from 70065dc to 0e66f49 Compare May 19, 2016 08:47

steveloughran and others added 14 commits March 20, 2017 14:15

[SPARK-7481] update docs by culling section on cloud integration test…

576b72c

…s; link to remaning docs from top level.

[SPARK-7481] updated documentation as per review

f9d0923

[SPARK-7481] SBT will build this now, optionally

1fab96e

[SPARK-7481] cloud POM includes jackson-dataformat-cbor, so that the …

4065c28

…CP is set up consistently for the later versions of the AWS SDK

[SPARK-7481] rebase with master; Pom had got out of sync

797ec49

[SPARK-7481] rename spark-cloud module to spark-hadoo-cloud, in POMs …

5768c42

…and docs

[SPARK-7841] bump up cloud pom to 2.2.0-SNAPSHOT; other minor pom cle…

0fcdc36

…anup

[SPARK-7481] builds against Hadoop shaded 3.x clients failing as dire…

b6d2002

…ct references to AWS classes failing. Cut them and rely on transitive load through FS class instantation to force the load. All that happens is that failures to link will be slightly less easy to debug.

[SPARK-7481] update 2.7 dependencies to include azure, aws and openst…

abae7fb

…ack JARs, transitive dependencies on aws and azure SDKs

[SPARK-7481] add joda time as the dependency. Tested against hadoop b…

6851aa4

…ranch-2, s3 ireland

SPARK-7481 purge all tests from the cloud module

c738048

SPARK-7481 add cloud module to sbt sequence

ea5e1fa

Change-Id: I3dea2544f089615493163f0fae482992873f9c35

SPARK-7481 break line of mvn XML declaration

aa4ea89

Change-Id: Ibd6d40df2bc8a2edf19a058c458bea233ba414fd

SPARK-7481 cloud pom is still JAR (not pom). works against Hadoop 2.6…

83d9368

… as well as 2.7, keeping azure the 2.7.x dependency. All dependencies are scoped @ hadoop.scope Change-Id: I80bd95fd48e21cf2eb4d94907ac99081cd3bd375

steveloughran force-pushed the features/SPARK-7481-cloud branch from 94aa9ea to 83d9368 Compare March 20, 2017 19:20

srowen reviewed Apr 8, 2017

View reviewed changes

steveloughran closed this Apr 24, 2017

robert3005 mentioned this pull request Apr 25, 2017

[SPARK-7481] Add cloud dependencies palantir/spark#169

Merged

srowen reviewed Apr 27, 2017

View reviewed changes

steveloughran mentioned this pull request May 2, 2017

[SPARK-7481] [build] Add spark-hadoop-cloud module to pull in object store access. #17834

Closed

		## <a name="introduction"></a>Introduction


		All the public cloud infrastructures, Amazon AWS, Microsoft Azure, Google GCS and others offer


		### Directory Operations May be Slow and Non-atomic

		Directory rename and delete may be performed as a series of operations. Specifically, recursive


		### Data is Not Written Until the OutputStream's `close()` Operation.

		Data written to the object store is often buffered to a local file or stored in memory,

[SPARK-7481] [build] Add spark-cloud module to pull in object store access, + documentation #12004

[SPARK-7481] [build] Add spark-cloud module to pull in object store access, + documentation #12004

Conversation

steveloughran commented Mar 28, 2016 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

steveloughran commented Mar 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 28, 2016

steveloughran commented Mar 29, 2016

SparkQA commented Mar 29, 2016

steveloughran commented Mar 30, 2016

SparkQA commented Mar 30, 2016

SparkQA commented Apr 3, 2016

SparkQA commented Apr 15, 2016

SparkQA commented Apr 18, 2016

SparkQA commented Apr 20, 2016

SparkQA commented Apr 20, 2016

steveloughran commented Apr 26, 2016

steveloughran commented Apr 26, 2016

SparkQA commented Apr 26, 2016

steveloughran commented Apr 26, 2016

SparkQA commented Apr 26, 2016

SparkQA commented May 4, 2016

steveloughran commented May 6, 2016

SparkQA commented May 19, 2016

steveloughran commented Mar 20, 2017 • edited Loading

SparkQA commented Mar 20, 2017

steveloughran commented Mar 29, 2017

steveloughran commented Apr 8, 2017

srowen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steveloughran commented May 2, 2017

steveloughran commented Mar 28, 2016 •

edited

Loading

steveloughran commented Mar 20, 2017 •

edited

Loading