Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7481] [build] Add spark-hadoop-cloud module to pull in object store access. #17834

Closed

Conversation

steveloughran
Copy link
Contributor

@steveloughran steveloughran commented May 2, 2017

What changes were proposed in this pull request?

Add a new spark-hadoop-cloud module and maven profile to pull in object store support from hadoop-openstack, hadoop-aws and hadoop-azure (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores s3n:// access to S3, adds its s3a:// replacement, OpenStack swift:// and azure wasb://.

There's a documentation page, cloud_integration.md, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to #12004; I can't re-open it)

How was this patch tested?

Downstream tests exist in https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: build/sbt -Phadoop-cloud -Phadoop-2.7 package
maven build mvn install -Phadoop-cloud -Phadoop-2.7

This PR does not update dev/deps/spark-deps-hadoop-2.7 or dev/deps/spark-deps-hadoop-2.6, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

steveloughran and others added 25 commits April 25, 2017 10:57
…azure, hadoop-aws, hadoop-openstack and implicitly their transitive dependencies are resolved. They don't verify all dependency setup, specifically that Jackson versions are consistent; that needs integration testing.
…eam tests which pulled in spark-cloud but not spark-hive were ending up with inconsistent versions. Add a test for the missing class being there too.
…CP is set up consistently for the later versions of the AWS SDK
…ct references to AWS classes failing. Cut them and rely on transitive load through FS class instantation to force the load. All that happens is that failures to link will be slightly less easy to debug.
…ack JARs, transitive dependencies on aws and azure SDKs
Change-Id: I3dea2544f089615493163f0fae482992873f9c35
Change-Id: Ibd6d40df2bc8a2edf19a058c458bea233ba414fd
… as well as 2.7, keeping azure the 2.7.x dependency. All dependencies are scoped @ hadoop.scope

Change-Id: I80bd95fd48e21cf2eb4d94907ac99081cd3bd375
Change-Id: I91f764aeed7d832df1538453d869a7fd83964d65
Change-Id: I12ea6ed72ffa9edee964c90c862ff4c45bc4f47f
…urity, spark-specific options + links elsewhere

Change-Id: I7e9efe20d116802a403af875b241b91178078d78
Change-Id: I1923a4b6a959d86aa2c5b3d71faaaf2541d3ba85
Change-Id: I6a0b0b9f06a4adcdf55ef75161dc1039961bc7a1
Change-Id: Ic4804667af8e52b7be11fb00621ad8b69a1d2569
Change-Id: I2b75a2722f0082b916b9be20bd23a0bdc2d36615
@srowen
Copy link
Member

srowen commented May 2, 2017

OK, but does this address the comments from the other PR? it didn't look like it (yet) from a quick look.

@SparkQA
Copy link

SparkQA commented May 2, 2017

Test build #76390 has finished for PR 17834 at commit 72a03ed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor Author

The last one was on all the doc comments, and believe I've addressed them both with the little typos and by focusing the docs on the main points for Spark users: how stores differ from filesystems, and what it means

The big issue for spark users is "the commit problem", where I've listed the object store behaviours and said "this means things may not work —consult the docs". I'm not being explicit as what works/doesn't work as that's the moving target.

Right now, I don't trust commits with S3a using V1 or V2 FileOutputCommitter algorithms to work 100% of the time, because they rely on a list consistency which Amazon S3 doesn't guarantee. I could make that a lot clearer, something like

You cannot reliably use the FileOutputCommitter to commit work to Amazon S3 or Openstack Swift unless there is some form of consistency layer on top.

That probably is the core concept people need to know: it's not safe without something (EMR, S3Guard, Databricks Commit Service) to give you that consistent view.

Then add pointers to the talks by myself and Eric Liang on the topic

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK still a few comments but yes you're right most were addressed

assembly/pom.xml Outdated
Pull in spark-hadoop-cloud and its associated JARs,
-->
<profile>
<id>cloud</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing from #12004 (comment) : I still think this should be hadoop-cloud just like the artifact name is spark-hadoop-cloud. The profile name itself doesn't cause an artifact to be generated called hadoop-cloud.

pom.xml Outdated
the AWS module pulls in jackson; its transitive dependencies can create
intra-jackson-module version problems.
-->
<dependency>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing #12004 (comment) - I think it's useful to put things in the parent when they're used across multiple modules. For something only referenced in one child POM, it is probably less overhead to just put it the dependency in the child rather than split and repeat part of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll do that. It's easily enough revisited in future.

The one thing that will probably need revisiting jackson dependency, as that's got the highest risk that someone will add it to another module. Though as it is driven by the same jackson.version variable as everywhere else, a duplicate declaration elsewhere isn't going to cause any harm

cloud/pom.xml Outdated
@@ -0,0 +1,106 @@
<?xml version="1.0" encoding="UTF-8"?>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just noticed this -- should this perhaps be in a directory hadoop-cloud to match the profile and module name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


This uses the "version 2" algorithm for committing files, which does less
renaming than the "version 1" algorithm, though as it still uses `rename()`
to commit files, it may be unsafe to use.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to give advice here, as we discussed. It's possible to link to external documentation. Here, however, it sounds like you're saying version 2 is recommended, then immediately saying it could be unsafe to use. What does the reader make of that? what's the upside, what's the nature and likelihood of failure?

If there's no clear answer, is this really a reliable recommendation?
Should this perhaps call out the issue and link to a larger discussion instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to think of a better phrasing, saying "if your object store is consistent enough use v2 for speed".

*  module is renamed hadoop-cloud in POMs, sbt, docs
*  hadoop-aws/azure/openstack declarations pushed down to hadoop-cloud pom, along with jackson-cbor
* docs around the commit algorithm option make clear that you should only worry about v1 vs v2 if the blobstore is consistent

Change-Id: Ia114bc8cd2ef731d54a83774d9dc2cf9e4c6e7d4
@steveloughran
Copy link
Contributor Author

I've just pushed up an update which changes the module name; tested in maven and SBT; hadoop cloud JAR dependencies pulled down.

A JAR is created, it's just a stub one. As a result, when you build the assembly with -Phadoop-cloud, you get something from spark mixed in with the hadoop modules

assembly/target/scala-2.11/jars/hadoop-aws-2.7.3.jar
assembly/target/scala-2.11/jars/hadoop-azure-2.7.3.jar
assembly/target/scala-2.11/jars/hadoop-client-2.7.3.jar
assembly/target/scala-2.11/jars/hadoop-cloud_2.11-2.3.0-SNAPSHOT.jar
assembly/target/scala-2.11/jars/hadoop-common-2.7.3.jar
assembly/target/scala-2.11/jars/hadoop-hdfs-2.7.3.jar
assembly/target/scala-2.11/jars/hadoop-mapreduce-client-app-2.7

Like I've said before, that's a bit dangerous, especially with a hadoop-cloud-projects POM module now upstream. We won't ever produce a hadoop-cloud-projects JAR, so there won't be direct conflict, more potential confusion if people see a jar beginning hadoop-* with different version info.

@SparkQA
Copy link

SparkQA commented May 3, 2017

Test build #76426 has finished for PR 17834 at commit b788494.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 4, 2017

That's because you also changed the artifact name to hadoop-cloud-.... That should remain spark-hadoop-cloud-.... It's the directory, module name, and profile ID that should be hadoop-cloud. It is at least consistent with how other modules and profiles are named.

@steveloughran
Copy link
Contributor Author

OK, now I understand. let me revert that bit of the patch

…oud. Added an extra callout to the docs "without some form of consistency layer, Amazon S3 cannot

be safely used as the direct destination of work with the normal rename-based committer"

That holds for all Spark-in-EC deployments; EMR has a consistency option, as do/will others; different committers can work reliably.

Change-Id: Ibbf8b1b9de10b5bb83b647cc663ceb970f70ff2d
Testing: SBT and maven, with & without the hadoop-2.7 option.
@SparkQA
Copy link

SparkQA commented May 4, 2017

Test build #76457 has finished for PR 17834 at commit e173e3f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor changes now, this looks fine

* Rename operations may be very slow and, on failure, leave the store in an unknown state.
* Seeking within a file may require new HTTP calls, hurting performance.

How does affect Spark?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed a small typo: How does this affect Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, "how does this affect Spark?"

They cannot be used as a direct replacement for a cluster filesystem such as HDFS
*except where this is explicitly stated*.

Key differences are
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd end the line with a colon to make it clear it's not dangling

### Important: Cloud Object Stores are Not Real Filesystems

While the stores appear to be filesystems, underneath
they are still object stores, [and the difference is significant](http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/introduction.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we're down to tiny nits, I'd link everywhere to https:// URLs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. Done throughout the file except in the apache license header. Rendered HTML and clicked through the links.


<artifactId>spark-hadoop-cloud_2.11</artifactId>
<packaging>jar</packaging>
<name>Spark Project Cloud Integration</name>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Spark Project Cloud Object Store Integration for Hadoop? I feel like getting "Hadoop" in there is useful to somehow emphasize this is adding connectivity from HDFS APIs to object stores in particular

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it works just as well standalone & on mesos, I'll try " through Hadoop Libraries" & see how that reads.

…uple of superflous blank lines

Change-Id: Iee9f0e0527de7bb875d1c2a805a0847702bb4e11
@SparkQA
Copy link

SparkQA commented May 5, 2017

Test build #76492 has finished for PR 17834 at commit 32ebc8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented May 7, 2017

Merged to master

@asfgit asfgit closed this in 2cf83c4 May 7, 2017
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…tore access.

## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to apache#12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.
jisookim0513 pushed a commit to metamx/spark that referenced this pull request Aug 28, 2017
…tore access.

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to apache#12004; I can't re-open it)

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.
isimonenko pushed a commit to metamx/spark that referenced this pull request Oct 17, 2017
…tore access.

## What changes were proposed in this pull request?

Add a new `spark-hadoop-cloud` module and maven profile to pull in object store support from `hadoop-openstack`, `hadoop-aws` and `hadoop-azure` (Hadoop 2.7+) JARs, along with their dependencies, fixing up the dependencies so that everything works, in particular Jackson.

It restores `s3n://` access to S3, adds its `s3a://` replacement, OpenStack `swift://` and azure `wasb://`.

There's a documentation page, `cloud_integration.md`, which covers the basic details of using Spark with object stores, referring the reader to the supplier's own documentation, with specific warnings on security and the possible mismatch between a store's behavior and that of a filesystem. In particular, users are advised be very cautious when trying to use an object store as the destination of data, and to consult the documentation of the storage supplier and the connector.

(this is the successor to apache#12004; I can't re-open it)

## How was this patch tested?

Downstream tests exist in [https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples](https://github.com/steveloughran/spark-cloud-examples/tree/master/cloud-examples)

Those verify that the dependencies are sufficient to allow downstream applications to work with s3a, azure wasb and swift storage connectors, and perform basic IO & dataframe operations thereon. All seems well.

Manually clean build & verify that assembly contains the relevant aws-* hadoop-* artifacts on Hadoop 2.6; azure on a hadoop-2.7 profile.

SBT build: `build/sbt -Phadoop-cloud -Phadoop-2.7 package`
maven build `mvn install -Phadoop-cloud -Phadoop-2.7`

This PR *does not* update `dev/deps/spark-deps-hadoop-2.7` or `dev/deps/spark-deps-hadoop-2.6`, because unless the hadoop-cloud profile is enabled, no extra JARs show up in the dependency list. The dependency check in Jenkins isn't setting the property, so the new JARs aren't visible.

Author: Steve Loughran <[email protected]>
Author: Steve Loughran <[email protected]>

Closes apache#17834 from steveloughran/cloud/SPARK-7481-current.

(cherry picked from commit 2cf83c4)
@steveloughran steveloughran deleted the cloud/SPARK-7481-current branch March 18, 2019 17:58
dongjoon-hyun added a commit that referenced this pull request May 4, 2024
… `jvm-profiler` modules

### What changes were proposed in this pull request?

This PR aims to fix `dev/scalastyle` to check `hadoop-cloud` and `jam-profiler` modules.
Also, the detected scalastyle issues are fixed.

### Why are the changes needed?

To prevent future scalastyle issues.

Scala style violation was introduced here, but we missed because we didn't check all optional modules.
- #46022

`jvm-profiler` module was added newly at Apache Spark 4.0.0 but we missed to add this to `dev/scalastyle`. Note that there was no scala style issues in that `module` at that time.
- #44021

`hadoop-cloud` module was added at Apache Spark 2.3.0.
- #17834

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with newly revised `dev/scalastyle`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46376 from dongjoon-hyun/SPARK-48127.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
sinaiamonkar-sai pushed a commit to sinaiamonkar-sai/spark that referenced this pull request May 5, 2024
… `jvm-profiler` modules

### What changes were proposed in this pull request?

This PR aims to fix `dev/scalastyle` to check `hadoop-cloud` and `jam-profiler` modules.
Also, the detected scalastyle issues are fixed.

### Why are the changes needed?

To prevent future scalastyle issues.

Scala style violation was introduced here, but we missed because we didn't check all optional modules.
- apache#46022

`jvm-profiler` module was added newly at Apache Spark 4.0.0 but we missed to add this to `dev/scalastyle`. Note that there was no scala style issues in that `module` at that time.
- apache#44021

`hadoop-cloud` module was added at Apache Spark 2.3.0.
- apache#17834

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with newly revised `dev/scalastyle`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46376 from dongjoon-hyun/SPARK-48127.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
… `jvm-profiler` modules

### What changes were proposed in this pull request?

This PR aims to fix `dev/scalastyle` to check `hadoop-cloud` and `jam-profiler` modules.
Also, the detected scalastyle issues are fixed.

### Why are the changes needed?

To prevent future scalastyle issues.

Scala style violation was introduced here, but we missed because we didn't check all optional modules.
- apache#46022

`jvm-profiler` module was added newly at Apache Spark 4.0.0 but we missed to add this to `dev/scalastyle`. Note that there was no scala style issues in that `module` at that time.
- apache#44021

`hadoop-cloud` module was added at Apache Spark 2.3.0.
- apache#17834

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with newly revised `dev/scalastyle`.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46376 from dongjoon-hyun/SPARK-48127.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants