GitHub - shubhluck/spline: Data Lineage Tracking and Visualization tool for Apache Spark ™

Spline (from Spark lineage) project helps people get insight into data processing performed by Apache Spark ™

The project consists of three main parts:

Spark Agent that sits on drivers, capturing the data lineage from Spark jobs being executed by analyzing the execution plans
Rest Gateway, that receive the lineage data from agent and stores it in the database
Web UI application that visualizes the stored data lineages

There are several other tools. Check the examples to get a better idea how to use Spline.

Other docs/readme files can be found at:

Spline currently supports Spark 2.2+, but in older versions (especially 2.2) lineage information provided by spark is limited.

Motivation

Spline aims to fill a big gap within the Apache Hadoop ecosystem. Spark jobs shouldn’t be treated only as magic black boxes; people should be able to understand what happens with their data. Our main focus is to solve the following particular problems:

Regulatory requirement for SA banks (BCBS 239)

By 2020, all South African banks will have to be able to prove how numbers are calculated in their reports to the regulatory authority.
Documentation of business logic

Business analysts should get a chance to verify whether Spark jobs were written according to the rules they provided. Moreover, it would be beneficial for them to have up-to-date documentation where they can refresh their knowledge of a project.
Identification of performance bottlenecks

Our focus is not only business-oriented; we also see Spline as a development tool that should be able to help developers with the performance optimization of their Spark jobs.

Get Spline

To get started, you need to get a minimal set of Spline's moving parts - a server, an admin tool and a client Web UI to see the captured lineage.

There are two ways how to do it:

Download prebuild Spline artifacts from the Maven repo

(REST Server and Web Client modules are also available as Docker containers)

-or-

Build Spline from the source code

Make sure you have JDK 8, Maven and NodeJS installed.

Get and unzip the Spline source code:

wget https://github.com/AbsaOSS/spline/archive/release/0.4.0.zip
unzip 0.4.0.zip

Change the directory:
```
cd spline-release-0.4.0
```
Run the Maven build:
```
mvn install -DskipTests
```

Install ArangoDB

Spline server requires ArangoDB to run. Please install ArangoDB 3.5+ according to the instructions here

If you prefer a Docker image there is a Docker repo as well.

docker pull arangodb:3.5.1

Create Spline Database

java -jar admin/target/admin-0.4.0.jar db-init arangodb://localhost/spline

Start Spline Server

The easiest way to spin up the Spline server is to use Docker:

docker container run \
      -e spline.database.connectionUrl=arangodb://172.17.0.1/spline \
      -p 8080:8080 \
      absaoss/spline-rest-server

Or you can deploy it as a WAR-file into any Java compatible Web-Container, e.g. Tomcat. You can find a WAR-file in the Maven repo here: za.co.absa.spline:rest-gateway:0.4.0

Add the argument for the arango connection string -Dspline.database.connectionUrl=arangodb://localhost/spline

The server exposes the following REST API:

Producer API (/producer/*)
Consumer API (/consumer/*)

... and other useful URLs:

Running server version information: /about/version
Producer API Swagger documentation: /docs/producer.html
Consumer API Swagger documentation: /docs/consumer.html

Start Spline UI

Spline web client can be started using 3 diffrent ways:

Docker:

docker container run \
      -e spline.consumer.url=http://172.17.0.1:8080/consumer \
      -p 9090:8080 \
      absaoss/spline-web-client

Java compatible Web-Container:

You can find the WAR-file of the Web Client in the repo here: za.co.absa.spline:client-web:0.4.0

Add the argument for the consumer url -Dspline.consumer.url=http://localhost:8080/consumer

Node JS application:

Download node.js then install @angular/cli to run ng serve or ng-build command.

To specify the consumer url please edit the config.json file

You can find the documentation of this module in ClientUI.

Check the result in the browser

http://localhost:9090

Use spline in your application

Add a dependency on Spark Agent.

<dependency>
    <groupId>za.co.absa.spline</groupId>
    <artifactId>spark-agent</artifactId>
    <version>0.4.0</version>
</dependency>

In your spark job you have to enable spline.

// given a Spark session ...
val sparkSession: SparkSession = ???

// ... enable data lineage tracking with Spline
import za.co.absa.spline.harvester.SparkLineageInitializer._
sparkSession.enableLineageTracking()

// ... then run some Dataset computations as usual.
// Data lineage of the job will be captured and stored in the
// configured database for further visualization by Spline Web UI

Properties

You also need to set some configuration properties. Spline combine these properties from several sources:

Hadoop config (core-site.xml)
JVM system properties
spline.properties file in the classpath

`spline.mode`

DISABLED Lineage tracking is completely disabled and Spline is unhooked from Spark.
REQUIRED If Spline fails to initialize itself (e.g. wrong configuration, no db connection etc) the Spark application aborts with an error.
BEST_EFFORT (default) Spline will try to initialize itself, but if fails it switches to DISABLED mode allowing the Spark application to proceed normally without Lineage tracking.

`spline.producer.url`

url of spline producer (part of rest gateway responsible for storing lineages in database)

Example:

spline.mode=REQUIRED
spline.producer.url=http://localhost:8080/spline

Copyright 2019 ABSA Group Limited

you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 1,076 Commits
admin		admin
client-ui		client-ui
client-web		client-web
client-webjar		client-webjar
commons		commons
consumer-rest-core		consumer-rest-core
consumer-services		consumer-services
examples		examples
integration-tests		integration-tests
migrator-tool		migrator-tool
parent		parent
persistence		persistence
producer-model		producer-model
producer-rest-core		producer-rest-core
producer-services		producer-services
rest-api-doc-generator		rest-api-doc-generator
rest-gateway		rest-gateway
spark		spark
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Motivation

Get Spline

Download prebuild Spline artifacts from the Maven repo

Build Spline from the source code

Install ArangoDB

Create Spline Database

Start Spline Server

Start Spline UI

Check the result in the browser

Use spline in your application

Properties

`spline.mode`

`spline.producer.url`

About

Releases

Packages

Languages

License

shubhluck/spline

Folders and files

Latest commit

History

Repository files navigation

Motivation

Get Spline

Download prebuild Spline artifacts from the Maven repo

Build Spline from the source code

Install ArangoDB

Create Spline Database

Start Spline Server

Start Spline UI

Check the result in the browser

Use spline in your application

Properties

spline.mode

spline.producer.url

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`spline.mode`

`spline.producer.url`

Packages