Poor Man's Data Pipeline

A minimal way to get an extremely robust, scalable, and cheap data pipeline up and running.

Motivation

Most data pipelines require a large infrastructure to get up and running. The intent of Poor Man's Data Pipeline is to use a variety of "serverless" components in order to build a data pipeline that has very few points of failure while still scaling to large volumes at low cost.

How it works

The pipeline works by using Amazon's Elastic Load Balancer with access logs enabled. These logs are then stored on S3 and are parsed and aggregated via simple Lambda functions.

Components

This project is still a work in progress but the components are listed below. A simple way to see how they're wired together is to take a look at the parse_elb_log_lambda.py file which is configured to be called by an AWS lambda function.

Line parser: This is responsible for parsing each line of an S3 access log file. This is what you will need to change in order to support configure your logging levels.
File parser: This just runs the line parser across each line of the log file. There are few options here for testing but generally you'll want to use the S3Parser class.
Summary writer: This takes the result of the parse and exports it. At the moment I only have a simple writer back to S3 but one can build additional functionality to write it to a database or send the summary to another service.

How to get it working

You will need to do two things.

Set up an Elastic Load Balancer and enable access logging. Note that you don't need to connect any instances to it since it will still be able to log every request. Note that responses to the ELB will have a 503 status code.
Set up an AWS Lambda function to parse the resulting access logs. The default code will do a simple count grouped by date and path and then upload them back to the original bucket. You can set up AWS Lambda by creating a zip archive and setting it up in the AWS console.

cd pmdp
zip -r lambda.zip *

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github/workflows		.github/workflows
img		img
pmdp		pmdp
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST		MANIFEST
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Poor Man's Data Pipeline

Motivation

How it works

Components

How to get it working

About

Releases

Packages

Languages

License

dangoldin/poor-mans-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Poor Man's Data Pipeline

Motivation

How it works

Components

How to get it working

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages