Skip to content

Commit

Permalink
Merge branch 'master' of https://github.com/gilv/spark
Browse files Browse the repository at this point in the history
Conflicts:
	docs/openstack-integration.md
  • Loading branch information
gilv committed Jun 10, 2014
2 parents eff538d + ce483d7 commit 9b625b5
Showing 1 changed file with 64 additions and 15 deletions.
79 changes: 64 additions & 15 deletions docs/openstack-integration.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,51 @@
---
layout: global
yout: global
title: Accessing Openstack Swift storage from Spark
---

# Accessing Openstack Swift storage from Spark

Spark's file interface allows it to process data in Openstack Swift using the same URI formats that are supported for Hadoop. You can specify a path in Swift as input through a URI of the form `swift://<container.service_provider>/path`. You will also need to set your Swift security credentials, through `SparkContext.hadoopConfiguration`.
Spark's file interface allows it to process data in Openstack Swift using the same URI

formats that are supported for Hadoop. You can specify a path in Swift as input through a

URI of the form `swift://<container.service_provider>/path`. You will also need to set your

Swift security credentials, through `SparkContext.hadoopConfiguration`.

#Configuring Hadoop to use Openstack Swift
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver](https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous Hadoop versions will need to configure Swift driver manually. Current Swift driver requieres Swift to use Keystone authentication method. There are recent efforts to support also temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and setup Swift FS.
Openstack Swift driver was merged in Hadoop verion 2.3.0 ([Swift driver]

(https://issues.apache.org/jira/browse/HADOOP-8545)). Users that wish to use previous

Hadoop versions will need to configure Swift driver manually. Current Swift driver

requieres Swift to use Keystone authentication method. There are recent efforts to support

also temp auth [Hadoop-10420](https://issues.apache.org/jira/browse/HADOOP-10420).
To configure Hadoop to work with Swift one need to modify core-sites.xml of Hadoop and

setup Swift FS.

<configuration>
<property>
<name>fs.swift.impl</name>
<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>

<value>org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem</value>
</property>
</configuration>

#Configuring Swift
Proxy server of Swift should include `list_endpoints` middleware. More information available [here](https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)
Proxy server of Swift should include `list_endpoints` middleware. More information

available [here]

(https://github.com/openstack/swift/blob/master/swift/common/middleware/list_endpoints.py)

#Configuring Spark
To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar` distributted with Hadoop 2.3.0.
To use Swift driver, Spark need to be compiled with `hadoop-openstack-2.3.0.jar`

distributted with Hadoop 2.3.0.
For the Maven builds, Spark's main pom.xml should include

<swift.version>2.3.0</swift.version>
Expand All @@ -42,10 +65,26 @@ in addition, pom.xml of the `core` and `yarn` projects should include
</dependency>


Additional parameters has to be provided to the Swift driver. Swift driver will use those parameters to perform authentication in Keystone prior accessing Swift. List of mandatory parameters is : `fs.swift.service.<PROVIDER>.auth.url`, `fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`,
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`, where `PROVIDER` is any name. `fs.swift.service.<PROVIDER>.auth.url` should point to the Keystone authentication URL.
Additional parameters has to be provided to the Swift driver. Swift driver will use those

Create core-sites.xml with the mandatory parameters and place it under /spark/conf directory. For example:
parameters to perform authentication in Keystone prior accessing Swift. List of mandatory

parameters is : `fs.swift.service.<PROVIDER>.auth.url`,

`fs.swift.service.<PROVIDER>.auth.endpoint.prefix`, `fs.swift.service.<PROVIDER>.tenant`,

`fs.swift.service.<PROVIDER>.username`,
`fs.swift.service.<PROVIDER>.password`, `fs.swift.service.<PROVIDER>.http.port`,

`fs.swift.service.<PROVIDER>.http.port`, `fs.swift.service.<PROVIDER>.public`, where

`PROVIDER` is any name. `fs.swift.service.<PROVIDER>.auth.url` should point to the Keystone

authentication URL.

Create core-sites.xml with the mandatory parameters and place it under /spark/conf

directory. For example:


<property>
Expand All @@ -68,9 +107,17 @@ Create core-sites.xml with the mandatory parameters and place it under /spark/co
<value>true</value>
</property>

We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`, `fs.swift.service.<PROVIDER>.password`. The best way to provide those parameters to SparkContext in run time, which seems to be impossible yet.
Another approach is to adapt Swift driver to obtain those values from system environment variables. For now we provide them via core-sites.xml.
Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml shoud include:
We left with `fs.swift.service.<PROVIDER>.tenant`, `fs.swift.service.<PROVIDER>.username`,

`fs.swift.service.<PROVIDER>.password`. The best way to provide those parameters to

SparkContext in run time, which seems to be impossible yet.
Another approach is to adapt Swift driver to obtain those values from system environment

variables. For now we provide them via core-sites.xml.
Assume a tenant `test` with user `tester` was defined in Keystone, then the core-sites.xml

shoud include:

<property>
<name>fs.swift.service.<PROVIDER>.tenant</name>
Expand All @@ -85,7 +132,9 @@ Assume a tenant `test` with user `tester` was defined in Keystone, then the core
<value>testing</value>
</property>
# Usage
Assume there exists Swift container `logs` with an object `data.log`. To access `data.log` from Spark the `swift://` scheme should be used.
Assume there exists Swift container `logs` with an object `data.log`. To access `data.log`

from Spark the `swift://` scheme should be used.
For example:

val sfdata = sc.textFile("swift://logs.<PROVIDER>/data.log")
Expand Down

0 comments on commit 9b625b5

Please sign in to comment.