Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-7481] [build] Add spark-cloud module to pull in object store access, + documentation #12004

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
1ecace7
[SPARK-7481] stripped down packaging only module
steveloughran Nov 18, 2016
38e74f9
[SPARK-7481] basic instantiation tests verify that dependency hadoop-…
steveloughran Nov 18, 2016
79b8ce0
[SPARK-7481] tests restricted to instantiation; logging modified appr…
steveloughran Nov 18, 2016
2bcbf3a
[SPARK-7481] declare httpcomponents:httpclient explicitly, as downstr…
steveloughran Nov 21, 2016
576b72c
[SPARK-7481] update docs by culling section on cloud integration test…
steveloughran Nov 21, 2016
f9d0923
[SPARK-7481] updated documentation as per review
steveloughran Nov 28, 2016
1fab96e
[SPARK-7481] SBT will build this now, optionally
steveloughran Nov 28, 2016
4065c28
[SPARK-7481] cloud POM includes jackson-dataformat-cbor, so that the …
steveloughran Nov 28, 2016
797ec49
[SPARK-7481] rebase with master; Pom had got out of sync
steveloughran Dec 1, 2016
5768c42
[SPARK-7481] rename spark-cloud module to spark-hadoo-cloud, in POMs …
steveloughran Dec 2, 2016
0fcdc36
[SPARK-7841] bump up cloud pom to 2.2.0-SNAPSHOT; other minor pom cle…
steveloughran Dec 14, 2016
b6d2002
[SPARK-7481] builds against Hadoop shaded 3.x clients failing as dire…
steveloughran Jan 10, 2017
abae7fb
[SPARK-7481] update 2.7 dependencies to include azure, aws and openst…
steveloughran Jan 20, 2017
6851aa4
[SPARK-7481] add joda time as the dependency. Tested against hadoop b…
steveloughran Jan 30, 2017
c738048
SPARK-7481 purge all tests from the cloud module
steveloughran Feb 24, 2017
ea5e1fa
SPARK-7481 add cloud module to sbt sequence
steveloughran Mar 20, 2017
aa4ea89
SPARK-7481 break line of mvn XML declaration
steveloughran Mar 20, 2017
83d9368
SPARK-7481 cloud pom is still JAR (not pom). works against Hadoop 2.6…
steveloughran Mar 20, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions assembly/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -226,5 +226,19 @@
<parquet.deps.scope>provided</parquet.deps.scope>
</properties>
</profile>

<!--
Pull in spark-hadoop-cloud and its associated JARs,
-->
<profile>
<id>cloud</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this hadoop-cloud perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so org/apache/spark + hadoop-cloud? I'll cause too much confusion were any JAR created thrown into a lib/ directory; you'd get

hadoop-aws-2.8.1.jar
spark-core-2.3.0
hadoop-cloud-2.3.0

& people would be trying to understand why the hadoop-* was out of sync, who to ping, etc.

There's actually a hadoop-cloudproject POM coming in hadoop-trunk to try and be a one-stop-dependency for all cloud bindings (avoiding the ongoing "declare new dependencies per version"). the names are way too close.

I'd had it as spark-cloud, you'd felt spark-hadoop-cloud was better. I can't think of what else would do, but I do think spark- is the string which should go at the front

<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hadoop-cloud_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
</profile>
</profiles>
</project>
117 changes: 117 additions & 0 deletions cloud/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<parent>
<groupId>org.apache.spark</groupId>
<artifactId>spark-parent_2.11</artifactId>
<version>2.2.0-SNAPSHOT</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to make this 2.3.0-SNAPSHOT now, because that's what's correct for master, then change it if it back-ports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd noticed that this morning....

<relativePath>../pom.xml</relativePath>
</parent>

<artifactId>spark-hadoop-cloud_2.11</artifactId>
<packaging>jar</packaging>
<name>Spark Project Cloud Integration</name>
<description>
Contains support for cloud infrastructures, specifically the Hadoop JARs and
transitive dependencies needed to interact with the infrastructures.

Any project which explicitly depends upon the spark-hadoop-cloud artifact will get the
dependencies; the exact versions of which will depend upon the hadoop version Spark was compiled
against.

The imports of transitive dependencies are managed to make them consistent
with those of the Spark build.

WARNING: the signatures of methods in the AWS and Azure SDKs do change between
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does an end user need to act on this -- the profile is in theory setting all this up correctly right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would only include the first sentence here. The description here should be short since nobody will likely read it. Anything substantive could go in docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cutting back to the first line, it can be covered in docs.

One option with the docs is to trim them back and say "consult the Hadoop documentation for object store setup, and I can be more explicit there on version pain.

versions: use exactly the same version with which the Hadoop JARs were
built.
</description>
<properties>
<sbt.project.name>hadoop-cloud</sbt.project.name>
</properties>

<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>

<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-openstack</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
<!--
Add joda time to ensure that anything downstream which doesn't pull in spark-hive
gets the correct joda time artifact, so doesn't have auth failures on later Java 8 JVMs
-->
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
<!-- explicitly declare the jackson artifacts desired -->
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-annotations</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.dataformat</groupId>
<artifactId>jackson-dataformat-cbor</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
<!--Explicit declaration to force in Spark version into transitive dependencies -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpclient</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
<!--Explicit declaration to force in Spark version into transitive dependencies -->
<dependency>
<groupId>org.apache.httpcomponents</groupId>
<artifactId>httpcore</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
</dependencies>

<profiles>

<profile>
<id>hadoop-2.7</id>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this only needs to come in for Hadoop 2.7+, not 2.6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

  • 2.7 adds hadoop-azure for wasb:
  • 2.8 adds hadoop-azure-datalake for adl:

There's going to be an aggregate POM in trunk, hadoop-cloud-storage, which declares all transitive stuff, ideally stripping down cruft we don't need. That way if new things go in, anything pulling that JAR shouldn't have to add new declarations. There's still the problem of transitive breakage of JARs (i.e. Jackson)

<dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-azure</artifactId>
<scope>${hadoop.deps.scope}</scope>
</dependency>
</dependencies>
</profile>

</profiles>

</project>
Loading