Skip to content

Getting Started

zliu41 edited this page Feb 6, 2015 · 39 revisions
  • Author: Ziyang
  • Reviewer: Lin

This page will guide you to set up Gobblin, and run a quick and simple first job.

Download and Build

  • Checkout Gobblin for Github:

git clone https://github.com/linkedin/gobblin.git

  • Build Gobblin: Gobblin is built using Gradle. cd to your gobblin-os folder and run

./gradlew clean build

Gobblin provides a Gradle wrapper, which means you do not need to install Gradle before building Gobblin.

<lqiao> It seems running gradlew will automatically install gradle. So gradle installation is not a prerequisite </lqiao>

Run Your First Job

Here we illustrate how to run a simple job. This job will pull the five latest revisions of each of the four Wikipedia pages: NASA, Linkedin, Parris_Cues and Barbara_Corcoran. A total of 20 records, each corresponding to one revision, should be pulled if the job is successfully run. The records will be stored as Avro files.

Gobblin can run either in standalone mode or on MapReduce. In this example we will run Gobblin in standalone mode.

Preliminary <lqiao> Introduction? </lqiao>

Each Gobblin job minimally involves three interfaces: Source, Extractor and DataWriter. As the names suggest, Source defines the source to pull data from, Extractor implements the logic to extract data records, and DataWriter defines the way the extracted records are output. A job may optionally have one or more Converters, which transform the extracted records (i.e., T in ETL), as well as one or more QualityCheckers that check the quality of the extracted records and determine whether they conform to certain policies.

Some of the classes relevant to this example include WikipediaSource, WikipediaExtractor, WikipediaConverter, and AvroHdfsDataWriter. We will not use QualityCheckers in this example.

To run the job we need a configuration file for Gobblin (this example uses gobblin-test.properties), as well as a configuration file for the job (this example uses wikipedia.pull). The Gobblin configuration file is passed as an argument to launch Gobblin. Job configuration files should be placed in the directory specified by jobconf.dir in the Gobblin configuration file. By default, Gobblin considers each .pull and .job file in this directory a job configuration file (this is customizable using jobconf.extensions), and will launch a job for each such file.

Steps

  • Create a folder to store the job configuration file. Put wikipedia.pull in this folder, and set environment variable GOBBLIN_JOB_CONFIG_DIR to point to this folder.

  • Create a folder as Gobblin's working directory. Gobblin will write job output as well as some other information there. Set environment variable GOBBLIN_WORK_DIR to point to that folder. <lqiao> should be GOBBLIN_WORK_DIR</lqiao>

  • Unpack Gobblin distribution:

tar -zxvf gobblin-dist.tar.gz
cd gobblin-dist

<lqiao> unpack gobblin distribution: tar -xzvf gobblin-dist.tar.gz; cd gobblin-dist </lqiao>

  • Launch Gobblin:

bin/gobblin-test.sh start

This script will launch Gobblin and pass the Gobblin configuration file (gobblin-test.properties) as an argument.

The job log, which contains the progress and status of the job, will be written into logs/gobblin-current.log (to change where the log is written, modify the Log4j configuration file conf/log4j-test.xml). Stdout will be written into nohup.out.

Among the job logs there should be the following information:

INFO JobScheduler - Loaded 1 job configuration
INFO  AbstractJobLauncher - Starting job job_PullFromWikipedia_1422040355678
INFO  TaskExecutor - Starting the task executor
INFO  LocalTaskStateTracker2 - Starting the local task state tracker
INFO  AbstractJobLauncher - Submitting task task_PullFromWikipedia_1422040355678_0 to run
INFO  TaskExecutor - Submitting task task_PullFromWikipedia_1422040355678_0
INFO  AbstractJobLauncher - Waiting for submitted tasks of job job_PullFromWikipedia_1422040355678 to complete... to complete...
INFO  AbstractJobLauncher - 1 out of 1 tasks of job job_PullFromWikipedia_1422040355678 are running
INFO  WikipediaExtractor - 5 record(s) retrieved for title NASA
INFO  WikipediaExtractor - 5 record(s) retrieved for title LinkedIn
INFO  WikipediaExtractor - 5 record(s) retrieved for title Parris_Cues
INFO  WikipediaExtractor - 5 record(s) retrieved for title Barbara_Corcoran
INFO  Task - Extracted 20 data records
INFO  Fork-0 - Committing data of branch 0 of task task_PullFromWikipedia_1422040355678_0
INFO  LocalTaskStateTracker2 - Task task_PullFromWikipedia_1422040355678_0 completed in 2334ms with state SUCCESSFUL
INFO  AbstractJobLauncher - All tasks of job job_PullFromWikipedia_1422040355678 have completed
INFO  TaskExecutor - Stopping the task executor 
INFO  LocalTaskStateTracker2 - Stopping the local task state tracker
INFO  AbstractJobLauncher - Publishing job data of job job_PullFromWikipedia_1422040355678 with commit policy COMMIT_ON_FULL_SUCCESS
INFO  AbstractJobLauncher - Persisting job/task states of job job_PullFromWikipedia_1422040355678
  • After the job is done, stop Gobblin by running

bin/gobblin-test.sh stop

The job output is written in GOBBLIN_WORK_DIR/job-output folder as an Avro file.

<lqiao> should be GOBBLIN_WORK_DIR</lqiao>

To see the content of the job output, use the Avro tools to convert Avro to JSON. Download the latest version of Avro tools (e.g. avro-tools-1.7.7.jar) from the Avro download page, and run

java -jar avro-tools-1.7.7.jar tojson --pretty [job_output].avro > output.json

output.json will contain all retrieved records in JSON format.

Note that since this job configuration file we used (wikipedia.pull) doesn't specify a job schedule, the job will run immediately and will run only once. To schedule a job to run at a certain time and/or repeatedly, set the job.schedule property with a cron-based syntax. For example, job.schedule=0 0/2 * * * ? will run the job every two minutes. See this link (Quartz CronTrigger) for more details.

Other Example Jobs

Besides the Wikipedia example, we have another example job SimpleJson, which extracts records from JSON files and store them in Avro files.

To create your own jobs, simply implement the relevant interfaces such as Source, Extractor, Converter, QualityChecker and DataWriter.

On a side note: while users are free to directly implement the Extractor interface (e.g., WikipediaExtractor), Gobblin also provides several extractor implementations based on commonly used protocols, e.g., RestApiExtractor, JdbcExtractor, SftpExtractor, etc. Users are encouraged to extend these classes to take advantage of existing implementations.

Clone this wiki locally