Skip to content

Publishing Data to S3

sahilTakiar edited this page Dec 31, 2015 · 57 revisions

Table of Contents

Introduction

While Gobblin is not tied to any specific cloud provider, Amazon Web Services is clearly a popular choice. This document will outline how Gobblin can publish data to S3. Specifically, it will provide a step by step guide to help setup Gobblin on Amazon EC2, run Gobblin on EC2, and publish data from EC2 to S3.

Any important thing to note is that it is recommended to configure Gobblin to first write data to EBS, and then publish the data to S3. This is the recommended approach because there are a few caveats when working with with S3. See the Hadoop and S3 section for more details.

This document will also provide a step by step guide for launching and configuring an EC2 instance, and a S3 bucket. However, it is my no means a source of truth guide to working with AWS, it will only provide high level steps. The best place to learn about how to use AWS is through the Amazon documentation.

Hadoop and S3

A majority of Gobblin's code base uses Hadoop's FileSystem object to read and write data. The FileSystem object is an abstract class, and typical implementations either write to the local file system, or write to HDFS. There has been significant work to create an implementation of the FileSystem object that reads and writes to S3. The best guide to read about the different S3 FileSystem implementations is here.

There are a few different S3 FileSystem implementations, the two of note are the s3a and the s3 file systems. The s3a file system is relatively new and is only available in Hadoop 2.6.0 (see the original JIRA for more information). The s3 filesystem has been around for a while.

The s3a File System

The s3a file system uploads files to a specified bucket. The data uploaded to S3 via this file system is interoperable with other S3 tools. However, there are a few caveats when working with this file system:

  • Since S3 does not support renaming of files in a bucket, the S3AFileSystem.rename(Path, Path) operation will actually copy data from the source Path to the destination Path, and then delete the source Path (see the source code for more information).
  • When creating a file using S3AFileSystem.create(...) data will be first written to a staging file on the local file system, and when the file is closed, the staging file will be uploaded to S3 (see the source code for more information).

Thus, when using the s3a file system with Gobblin it is recommended that one configures Gobblin to first write its staging data to the local filesystem, and then to publish the data to S3. The reason this is the recommended approach is that each Gobblin Task will write data to a staging file, and once the file has been completely written it publishes the file to a output directory (it does this by using a rename function). Finally, the DataPublisher moves the files from the staging directory to its final directory (again done using a rename function). This requires two renames operations and would be very inefficient if a Task wrote directly to S3.

Furthermore, writing directly to S3 requires creating a staging file on the local file system, and then creating a PutObjectRequest to upload the data to S3. This is logically equivalent to just configuring Gobblin to write to a local file and then publishing it to S3.

For more information on how configure Gobblin to publish data to S3, see the Gobblin S3 Configuration section.

The s3 File System

The s3 file system stores file as blocks, similar to how HDFS stores files. This makes renaming of files more efficient, but data written using this file system is not interoperable with other S3 tools. This limitation may make using this file system less desirable, so the majority of this document focuses on the s3a file system.

Gobblin S3 Configuration

This section will outline the different configuration parameters required to get Gobblin to publish to S3.

Getting Gobblin to Publish to S3

This section will provide a step by step guide to setting up to first setting up an EC2 instance and a S3 bucket, and then installing and configuring Gobblin to run on EC2 and publish data to S3.

This guide will use the free-tier provided by AWS to setup EC2 and S3.

Signing Up For AWS

In order to use EC2 and S3, one first needs to sign up for an AWS account. The easiest way to get started with AWS is to use their free tier.

Setting Up EC2

Launching an EC2 Instance

Once you have an AWS account, login to the AWS console. Select the EC2 link, which will bring you to the EC2 dashboard.

Click on Launch Instance to create a new EC2 instance. Before the instance actually starts to run, there area a few more configuration steps necessary:

  1. Choosing an Amazon Machine Image (AMI)
    • For this walkthrough we will pick Red Hat Enterprise Linux (RHEL)
  2. Choosing an Instance Type
    • Since this walkthrough using the Amazon Free Tier, we will pick the General Purpose t2.micro instance
    • This instance provides us with 1 vCPU and 1 GiB of RAM
    • For more information on other instance types, check out the docs
  3. Click Review and Launch
    • We will use the defaults for all other settings options.
    • When reviewing your instance, you will most likely get a warning saying access to your EC2 instance is open to the world
    • If you want to fix this you have to edit the Security Groups; how to do that is out of the scope of this document
  4. Setting up SSH Keys
    • After reviewing your instance, click Launch
    • You should be prompted to setup SSH keys
    • Use an existing key pair if you have one, otherwise create a new one and download it
    • You should be taken to a page called Launch Status that indicates your instance has been launched
    • You click on View Instances to monitor your launched EC2 instances
  5. SSH to Launched Instance
    • SSH using the following command: ssh -i my-private-key-file.pem ec2-user@instance-name
      • The instance-name can be taken from the Public DNS field from the instance information
      • SSH may complain that the private key file has insufficient permissions
        • Execute chmod 600 my-private-key-file.pem to fix this
      • Alternatively, one can modify the ~/.ssh/config/ file instead of specifying the -i option

After following the above steps, you should be able to freely SSH into the launched EC2 instance, and monitor / control the instance from the EC2 dashboard.

EC2 Package Installations

Before setting up Gobblin, you need to install Java first. Depending on the AMI instance you are running Java may or may not already be installed (you can check if Java is already installed by executing java -version.

Installing Java

  1. Execute sudo yum install java-1.8.0-openjdk* to install Open JDK 8
  2. Confirm the installation was successful by executing java -version
  3. Set the JAVA_HOME environment variable in the ~/.bashrc/ file
    • The value for JAVA_HOME can be found by executing readlink -f `which java`

Setting Up S3

Go to the S3 dashboard

  1. Click on Create Bucket
    • Enter a name for the bucket (e.g. gobblin-demo-bucket)
    • Enter a [Region](http://docs.aws.amazon.com/general/latest/gr/rande.html) for the bucket (e.g. US Standard)

Setting Up Gobblin on EC2

  1. Download and Build Gobblin Locally
    • On your local machine, clone the Gobblin repository: git clone [email protected]:linkedin/gobblin.git (this assumes you have Git installed locally)
    • Build Gobblin using the following commands: cd gobblin, ./gradlew clean build -PuseHadoop2 -PhadoopVersion=2.6.0 -x test
      • It is important to use Hadoop version 2.6.0 as it includes the s3a file system implementation
  2. Upload the Gobblin Tar to EC2
    • Execute the command: scp -i my-private-key-file.pem gobblin-dist.tar.gz ec2-user@instance-name:
  3. Un-tar the Gobblin Distribution
    • SSH to the EC2 Instance
    • Un-tar the Gobblin distribution: tar -xvf gobblin-dist.tar.gz
  4. Download AWS Libraries
    • A few JARs need to be downloaded using some cURL commands:
      • curl http://central.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar > gobblin-dist/lib/aws-java-sdk-1.7.4.jar

Running Gobblin on EC2

AWS IAM Roles

Clone this wiki locally