Skip to content

Latest commit

 

History

History
911 lines (722 loc) · 44.2 KB

1_intro.md

File metadata and controls

911 lines (722 loc) · 44.2 KB

Back to Index

Next: Data Ingestion

Table of contents

Introduction to Data Engineering

Data Engineering is the design and development of systems for collecting, storing and analyzing data at scale.

Architecture

During the course we will replicate the following architecture:

architecture diagram

  • New York's Taxi and Limousine Corporation's Trip Records Dataset: the dataset we will use during the course.
  • Spark: analytics engine for large-scale data processing (distributed processing).
  • Google BigQuery: serverless data warehouse (central repository of integrated data from one or more disparate sources).
  • Airflow: workflow management platform for data engineering pipelines. In other words, a pipeline orchestration tool.
  • Kafka: unified, high-throughput,low-latency platform for handling real-time data feeds (streaming).

Data pipelines

A data pipeline is a service that receives data as input and outputs more data. For example, reading a CSV file, transforming the data somehow and storing it as a table in a PostgreSQL database.

data pipeline

Back to the top

Docker and Postgres

Docker basic concepts

(Video source)

Docker is a containerization software that allows us to isolate software in a similar way to virtual machines but in a much leaner way.

A Docker image is a snapshot of a container that we can define to run our software, or in this case our data pipelines. By exporting our Docker images to Cloud providers such as Amazon Web Services or Google Cloud Platform we can run our containers there.

Docker provides the following advantages:

  • Reproducibility
  • Local experimentation
  • Integration tests (CI/CD)
  • Running pipelines on the cloud (AWS Batch, Kubernetes jobs)
  • Spark (analytics engine for large-scale data processing)
  • Serverless (AWS Lambda, Google functions)

Docker containers are stateless: any changes done inside a container will NOT be saved when the container is killed and started again. This is an advantage because it allows us to restore any container to its initial state in a reproducible manner, but you will have to store data elsewhere if you need to do so; a common way to do so is with volumes.

Note: you can learn more about Docker and how to set it up on a Mac in this link. You may also be interested in a Docker reference cheatsheet.

Creating a custom pipeline with Docker

(Video source)

Let's create an example pipeline. We will create a dummy pipeline.py Python script that receives an argument and prints it.

import sys
import pandas # we don't need this but it's useful for the example

# print arguments
print(sys.argv)

# argument 0 is the name os the file
# argumment 1 contains the actual first argument we care about
day = sys.argv[1]

# cool pandas stuff goes here

# print a sentence with the argument
print(f'job finished successfully for day = {day}')

We can run this script with python pipeline.py <some_number> and it should print 2 lines:

  • ['pipeline.py', '<some_number>']
  • job finished successfully for day = <some_number>

Let's containerize it by creating a Docker image. Create the folllowing Dockerfile file:

# base Docker image that we will build on
FROM python:3.9.1

# set up our image by installing prerequisites; pandas in this case
RUN pip install pandas

# set up the working directory inside the container
WORKDIR /app
# copy the script to the container. 1st name is source file, 2nd is destination
COPY pipeline.py pipeline.py

# define what to do first when the container runs
# in this example, we will just run the script
ENTRYPOINT ["python", "pipeline.py"]

Let's build the image:

docker build -t test:pandas .
  • The image name will be test and its tag will be pandas. If the tag isn't specified it will default to latest.

We can now run the container and pass an argument to it, so that our pipeline will receive it:

docker run -it test:pandas some_number

You should get the same output you did when you ran the pipeline script by itself.

Note: these instructions asume that pipeline.py and Dockerfile are in the same directory. The Docker commands should also be run from the same directory as these files.

Running Postgres in a container

(Video source)

In later parts of the course we will use Airflow, which uses PostgreSQL internally. For simpler tests we can use PostgreSQL (or just Postgres) directly.

You can run a containerized version of Postgres that doesn't require any installation steps. You only need to provide a few environment variables to it as well as a volume for storing data.

Create a folder anywhere you'd like for Postgres to store data in. We will use the example folder ny_taxi_postgres_data. Here's how to run the container:

docker run -it \
    -e POSTGRES_USER="root" \
    -e POSTGRES_PASSWORD="root" \
    -e POSTGRES_DB="ny_taxi" \
    -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
    -p 5432:5432 \
    postgres:13
  • The container needs 3 environment variables:
    • POSTGRES_USER is the username for logging into the database. We chose root.
    • POSTGRES_PASSWORD is the password for the database. We chose root
      • IMPORTANT: These values are only meant for testing. Please change them for any serious project.
    • POSTGRES_DB is the name that we will give the database. We chose ny_taxi.
  • -v points to the volume directory. The colon : separates the first part (path to the folder in the host computer) from the second part (path to the folder inside the container).
    • Path names must be absolute. If you're in a UNIX-like system, you can use pwd to print you local folder as a shortcut; this example should work with both bash and zsh shells, but fish will require you to remove the $.
    • This command will only work if you run it from a directory which contains the ny_taxi_postgres_data subdirectory you created above.
  • The -p is for port mapping. We map the default Postgres port to the same port in the host.
  • The last argument is the image name and tag. We run the official postgres image on its version 13.

Once the container is running, we can log into our database with pgcli with the following command:

pgcli -h localhost -p 5432 -u root -d ny_taxi
  • -h is the host. Since we're running locally we can use localhost.
  • -p is the port.
  • -u is the username.
  • -d is the database name.
  • The password is not provided; it will be requested after running the command.

Ingesting data to Postgres with Python

(Video source)

We will now create a Jupyter Notebook upload-data.ipynb file which we will use to read a CSV file and export it to Postgres.

We will use data from the NYC TLC Trip Record Data website. Specifically, we will use the Yellow taxi trip records CSV file for January 2021. A dictionary to understand each field is available here.

Note: knowledge of Jupyter Notebook, Python environment management and Pandas is asumed in these notes. Please check this link for a Pandas cheatsheet and this link for a Conda cheatsheet for Python environment management.

Check the completed upload-data.ipynb in this link for a detailed guide. Feel free to copy the file to your work directory; in the same directory you will need to have the CSV file linked above and the ny_taxi_postgres_data subdirectory.

Connecting pgAdmin and Postgres with Docker networking

(Video source)

pgcli is a handy tool but it's cumbersome to use. pgAdmin is a web-based tool that makes it more convenient to access and manage our databases. It's possible to run pgAdmin as as container along with the Postgres container, but both containers will have to be in the same virtual network so that they can find each other.

Let's create a virtual Docker network called pg-network:

docker network create pg-network

You can remove the network later with the command docker network rm pg-network . You can look at the existing networks with docker network ls .

We will now re-run our Postgres container with the added network name and the container network name, so that the pgAdmin container can find it (we'll use pg-database for the container name):

docker run -it \
    -e POSTGRES_USER="root" \
    -e POSTGRES_PASSWORD="root" \
    -e POSTGRES_DB="ny_taxi" \
    -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
    -p 5432:5432 \
    --network=pg-network \
    --name pg-database \
    postgres:13

We will now run the pgAdmin container on another terminal:

docker run -it \
    -e PGADMIN_DEFAULT_EMAIL="[email protected]" \
    -e PGADMIN_DEFAULT_PASSWORD="root" \
    -p 8080:80 \
    --network=pg-network \
    --name pgadmin \
    dpage/pgadmin4
  • The container needs 2 environment variables: a login email and a password. We use [email protected] and root in this example.
  • IMPORTANT: these are example values for testing and should never be used on production. Change them accordingly when needed.
  • pgAdmin is a web app and its default port is 80; we map it to 8080 in our localhost to avoid any possible conflicts.
  • Just like with the Postgres container, we specify a network and a name. However, the name in this example isn't really necessary because there won't be any containers trying to access this particular container.
  • The actual image name is dpage/pgadmin4 .

You should now be able to load pgAdmin on a web browser by browsing to localhost:8080. Use the same email and password you used for running the container to log in.

Right-click on Servers on the left sidebar and select Create > Server...

steps

Under General give the Server a name and under Connection add the same host name, user and password you used when running the container.

steps steps

Click on Save. You should now be connected to the database.

We will explore using pgAdmin in later lessons.

Using the ingestion script with Docker

(Video source)

We will now export the Jupyter notebook file to a regular Python script and use Docker to run it.

Exporting and testing the script

You can export the ipynb file to py with this command:

jupyter nbconvert --to=script upload-data.ipynb

Clean up the script by removing everything we don't need. We will also rename it to ingest_data.py and add a few modifications:

  • We will use argparse to handle the following command line arguments:
    • Username
    • Password
    • Host
    • Port
    • Database name
    • Table name
    • URL for the CSV file
  • The engine we created for connecting to Postgres will be tweaked so that we pass the parameters and build the URL from them, like this:
    engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
  • We will also download the CSV using the provided URL argument.

You can check the completed ingest_data.py script in this link.

In order to test the script we will have to drop the table we previously created. In pgAdmin, in the sidebar navigate to Servers > Docker localhost > Databases > ny_taxi > Schemas > public > Tables > yellow_taxi_data, right click on yellow_taxi_data and select Query tool. Introduce the following command:

DROP TABLE yellow_taxi_data;

We are now ready to test the script with the following command:

python ingest_data.py \
    --user=root \
    --password=root \
    --host=localhost \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_taxi_trips \
    --url="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"
  • Note that we've changed the table name from yellow_taxi_data to yellow_taxi_trips.

Back in pgAdmin, refresh the Tables and check that yellow_taxi_trips was created. You can also run a SQL query to check the contents:

SELECT
    COUNT(1)
FROM
    yellow_taxi_trips;
  • This query should return 1,369,765 rows.

Dockerizing the script

Let's modify the Dockerfile we created before to include our ingest_data.py script and create a new image:

FROM python:3.9.1

# We need to install wget to download the csv file
RUN apt-get install wget
# psycopg2 is a postgres db adapter for python: sqlalchemy needs it
RUN pip install pandas sqlalchemy psycopg2

WORKDIR /app
COPY ingest_data.py ingest_data.py 

ENTRYPOINT [ "python", "ingest_data.py" ]

Build the image:

docker build -t taxi_ingest:v001 .

And run it:

docker run -it \
    --network=pg-network \
    taxi_ingest:v001 \
    --user=root \
    --password=root \
    --host=pg-database \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_taxi_trips \
    --url="https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2021-01.csv"
  • We need to provide the network for Docker to find the Postgres container. It goes before the name of the image.
  • Since Postgres is running on a separate container, the host argument will have to point to the container name of Postgres.
  • You can drop the table in pgAdmin beforehand if you want, but the script will automatically replace the pre-existing table.

Running Postgres and pgAdmin with Docker-compose

(Video source)

docker-compose allows us to launch multiple containers using a single configuration file, so that we don't have to run multiple complex docker run commands separately.

Docker compose makes use of YAML files. Here's the docker-compose.yaml file for running the Postgres and pgAdmin containers:

services:
  pgdatabase:
    image: postgres:13
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4
    environment:
      - [email protected]
      - PGADMIN_DEFAULT_PASSWORD=root
    volumes:
      - "./data_pgadmin:/var/lib/pgadmin"
    ports:
      - "8080:80"
  • We don't have to specify a network because docker-compose takes care of it: every single container (or "service", as the file states) will run withing the same network and will be able to find each other according to their names (pgdatabase and pgadmin in this example).
  • We've added a volume for pgAdmin to save its settings, so that you don't have to keep re-creating the connection to Postgres every time ypu rerun the container. Make sure you create a data_pgadmin directory in your work folder where you run docker-compose from.
  • All other details from the docker run commands (environment variables, volumes and ports) are mentioned accordingly in the file following YAML syntax.

We can now run Docker compose by running the following command from the same directory where docker-compose.yaml is found. Make sure that all previous containers aren't running anymore:

docker-compose up

Note: this command asumes that the ny_taxi_postgres_data used for mounting the volume is in the same directory as docker-compose.yaml.

Since the settings for pgAdmin were stored within the container and we have killed the previous onem you will have to re-create the connection by following the steps in this section.

You will have to press Ctrl+C in order to shut down the containers. The proper way of shutting them down is with this command:

docker-compose down

And if you want to run the containers again in the background rather than in the foreground (thus freeing up your terminal), you can run them in detached mode:

docker-compose up -d

If you want to re-run the dockerized ingest script when you run Postgres and pgAdmin with docker-compose, you will have to find the name of the virtual network that Docker compose created for the containers. You can use the command docker network ls to find it and then change the docker run command for the dockerized script to include the network name.

SQL refresher

Below are a series of SQL query examples to remember how SQL works. For this example we'll asume that we're working with 2 tables named trips (list of all yelow taxi trips of NYC for January 2021) and zones (list of zone IDs for pick ups and drop offs).

Check the homework for the session to learn about the zones table.

For a more detailed look into SQL, check out this article.

SELECT
    *
FROM
    trips
LIMIT 100;
  • Selects all rows in the trips table. If there are more than 100 rows, select only the first 100.
SELECT
    *
FROM
    trips t,
    zones zpu,
    zones zdo
WHERE
    t."PULocationID" = zpu."LocationID" AND
    t."DOLocationID" = zdo."LocationID"
LIMIT 100;
  • Selects all rows in the trips table. If there are more than 100 rows, select only the first 100.
  • We give aliases to the trips and zones tables for easier access.
  • We replace the IDs inside PULocationID and DOLocationID with the actual zone IDs for pick ups and drop offs.
  • We use double quotes ("") for the column names because in Postgres we need to use them if the column names contains capital letters.
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", '/', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", '/', zdo."Zone") AS "dropoff_loc"
FROM
    trips t,
    zones zpu,
    zones zdo
WHERE
    t."PULocationID" = zpu."LocationID" AND
    t."DOLocationID" = zdo."LocationID"
LIMIT 100;
  • Same as previous but instead of the complete rows we only display specific columns.
  • We make use of joins (implicit joins in this case) to display combined info as a single column.
    • The new "virtual" column pickup_loc contains the values of both Borough and Zone columns of the zones table, separated by a slash (/).
    • Same for dropoff_loc.
  • More specifically this is an inner join, because we only select the rows that overlap between the 2 tables.
  • Learn more about SQL joins here and here.
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", '/', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", '/', zdo."Zone") AS "dropoff_loc"
FROM
    trips t JOIN zones zpu
        ON t."PULocationID" = zpu."LocationID"
    JOIN zones zdo
        ON t."DOLocationID" = zdo."LocationID"
LIMIT 100;
  • Exactly the same statement as before but rewritten using explicit JOIN keywords.
    • Explicit inner joins are preferred over implicit inner joins.
  • The JOIN keyword is used after the FROM statement rather than the WHERE statement. The WHERE statement is actually unneeded.
    SELECT whatever_columns FROM table_1 JOIN table_2_with_a_matching_column ON column_from_1=column_from_2
  • You can also use the keyword INNER JOIN for clarity.
  • Learn more about SQL joins here and here.
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    trips t
WHERE
    "PULocationID" is NULL
LIMIT 100;
  • Selects rows from the trips table whose pick up location is null and displays specific columns.
  • If you have not modified the original tables, this query should return an empty list.
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    "PULocationID",
    "DOLocationID"
FROM
    trips t
WHERE
    "DOLocationID" NOT IN (
        SELECT "LocationID" FROM zones
    )
LIMIT 100;
  • Selects rows fromn the trips table whose drop off location ID does not appear in the zones table.
  • If you did not modify any rows in the original datasets, the query would return an empty list.
DELETE FROM zones WHERE "LocationID" = 142;
  • Deletes all rows in the zones table with LocationID of 142.
  • If we were to run this query and then run the previous query, we would get a list of rows with PULocationID of 142.
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    total_amount,
    CONCAT(zpu."Borough", '/', zpu."Zone") AS "pickup_loc",
    CONCAT(zdo."Borough", '/', zdo."Zone") AS "dropoff_loc"
FROM
    trips t LEFT JOIN zones zpu
        ON t."PULocationID" = zpu."LocationID"
    LEFT JOIN zones zdo
        ON t."DOLocationID" = zdo."LocationID"
LIMIT 100;
  • Similar to the join query from before but we use a left join instead.
  • Left joins shows all rows from the "left" part of the statement but only the rows from the "right" part that overlap with the "left" part, thus the name.
  • This join is useful if we deleted one of the LocationID rows like before. The inner join would omit some rows from the trips table, but this query will show all rows. However, since one ID is missing, the "virtual" columns we defined to transform location ID's to actual names will appear with empty strings if the query cannot find the location ID.
  • Learn more about SQL joins here and here.
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    DATE_TRUNC('DAY', tpep_pickup_datetime),
    total_amount,
FROM
    trips t
LIMIT 100;
  • Selects all rows from the trips table but displays specific columns.
  • DATE_TRUNC is a function that trunctates a timestamp. When using DAY as a parameter, it removes any smaller values (hours, minutes, seconds) and displays them as 00:00:00 instead.
SELECT
    tpep_pickup_datetime,
    tpep_dropoff_datetime,
    CAST(tpep_pickup_datetime AS DATE) as "day",
    total_amount,
FROM
    trips t
LIMIT 100;
  • Very similar to previous query, but instead it casts the TIMESTAMP type to DATE, so that the hours:minutes:seconds info is completely omitted rather than show as 00:00:00. The columns will be displayed under the name day.
SELECT
    CAST(tpep_pickup_datetime AS DATE) as "day",
    COUNT(1)
FROM
    trips t
GROUP BY
    CAST(tpep_pickup_datetime AS DATE)
ORDER BY "day" ASC;
  • Counts the amount of records in the trips table grouped by day.
  • We remove the limit of 100 records because we do not want to restrict the amount of info on screen.
  • Grouping does not guarantee order, so we enforce that the rows will be displayed in ascending order from earliest to latest day.
SELECT
    CAST(tpep_pickup_datetime AS DATE) as "day",
    COUNT(1) as "count",
    MAX(total_amount),
    MAX(passenger_count)
FROM
    trips t
GROUP BY
    CAST(tpep_pickup_datetime AS DATE)
ORDER BY "count" DESC;
  • Similar to the previous query but orders the rows by count and displays them in descending order, so that the day with the highest amount of trips is shown first.
  • We also show the maximum amount that a driver earned in a trip for that day and the maximum passenger count on a single trip for that day.
SELECT
    CAST(tpep_pickup_datetime AS DATE) as "day",
    "DOLocationID",
    COUNT(1) as "count",
    MAX(total_amount),
    MAX(passenger_count)
FROM
    trips t
GROUP BY
    1, 2
ORDER BY "count" DESC;
  • Similar to previous but we also include the drop off location column and we group by it as well, so that each row contains the amount of trips for that location by day.
  • Instead of having to repeat the same line in both the SELECT and GROUP BY parts, we can simply indicate the arguments we use after the SELECT keyword by order number.
    • SQL is 1-indexed. The first argument is 1, not 0.
SELECT
    CAST(tpep_pickup_datetime AS DATE) as "day",
    "DOLocationID",
    COUNT(1) as "count",
    MAX(total_amount),
    MAX(passenger_count)
FROM
    trips t
GROUP BY
    1, 2
ORDER BY
    "day" ASC,
    "DOLocationID" ASC;
  • Similar to previous query but we now order by ascending order both by day and then drop off location ID, both in ascending order.

As a final note, SQL commands can be categorized into the following categories:

  • DDL: Data Definition Language.
    • Define the database schema (create, modify, destroy)
    • CREATE, DROP, ALTER, TRUNCATE, COMMENT, RENAME
  • DQL: Data Query Language.
    • Perform queries on the data within schema objects. Get data from the database and impose order upon it.
    • SELECT
  • DML: Data Manipulation Language.
    • Manipulates data present in the database.
    • INSERT, UPDATE, DELETE, LOCK...
  • DCL: Data Control Language.
    • Rights, permissions and other controls of the database system.
    • Usually grouped with DML commands.
    • GRANT, REVOKE
  • TCL: Transaction Control Language.
    • Transactions within the database.
    • Not a universally considered category.
    • COMMIT, ROLLBACK, SAVEPOINT, SET TRANSACTION

Back to the top

Terraform and Google Cloud Platform

(Video source)

Terraform is an infrastructure as code tool that allows us to provision infrastructure resources as code, thus making it possible to handle infrastructure as an additional software component and take advantage of tools such as version control. It also allows us to bypass the cloud vendor GUIs.

During this course we will use Google Cloud Platform (GCP) as our cloud services provider.

GCP initial setup

(Video source)

GCP is organized around projects. You may create a project and access all available GCP resources and services from the project dashboard.

We will now create a project and a service account, and we will download the authentication keys to our computer. A service account is like a user account but for apps and workloads; you may authorize or limit what resources are available to your apps with service accounts.

You can jump to the next section if you already know how to do this.

Please follow these steps:

  1. Create an account on GCP. You should receive $300 in credit when signing up on GCP for the first time with an account.
  2. Setup a new project and write down the Project ID.
    1. From the GCP Dashboard, click on the drop down menu next to the Google Cloud Platform title to show the project list and click on New project.
    2. Give the project a name. We will use dtc-de in this example. You can use the autogenerated Project ID (this ID must be unique to all of GCP, not just your account). Leave the organization as No organization. Click on Create.
    3. Back on the dashboard, make sure that your project is selected. Click on the previous drop down menu to select it otherwise.
  3. Setup a service account for this project and download the JSON authentication key files.
    1. IAM & Admin > Service accounts > Create service account
    2. Provide a service account name. We will use dtc-de-user. Leave all other fields with the default values. Click on Create and continue.
    3. Grant the Viewer role (Basic > Viewer) to the service account and click on Continue
    4. There is no need to grant users access to this service account at the moment. Click on Done.
    5. With the service account created, click on the 3 dots below Actions and select Manage keys.
    6. Add key > Create new key. Select JSON and click Create. The files will be downloaded to your computer. Save them to a folder and write down the path.
  4. Download the GCP SDK for local setup. Follow the instructions to install and connect to your account and project.
  5. Set the environment variable to point to the auth keys.
    1. The environment variable name is GOOGLE_APPLICATION_CREDENTIALS
    2. The value for the variable is the path to the json authentication file you downloaded previously.
    3. Check how to assign environment variables in your system and shell. In bash, the command should be:
      export GOOGLE_APPLICATION_CREDENTIALS="<path/to/authkeys>.json"
    4. Refresh the token and verify the authentication with the GCP SDK:
      gcloud auth application-default login

You should now be ready to work with GCP.

GCP setup for access

(Video source)

In the following chapters we will setup a Data Lake on Google Cloud Storage and a Data Warehouse in BigQuery. We will explore these concepts in future lessons but a Data Lake is where we would usually store data and a Data Warehouse provides a more structured way to access this data.

We need to setup access first by assigning the Storage Admin, Storage Object Admin, BigQuery Admin and Viewer IAM roles to the Service Account, and then enable the iam and iamcredentials APIs for our project.

Please follow these steps:

  1. Assign the following IAM Roles to the Service Account: Storage Admin, Storage Object Admin, BigQuery Admin and Viewer.
    1. On the GCP Project dashboard, go to IAM & Admin > IAM
    2. Select the previously created Service Account and edit the permissions by clicking on the pencil shaped icon on the left.
    3. Add the following roles and click on Save afterwards:
      • Storage Admin: for creating and managing buckets.
      • Storage Object Admin: for creating and managing objects within the buckets.
      • BigQuery Admin: for managing BigQuery resources and data.
      • Viewer should already be present as a role.
  2. Enable APIs for the project (these are needed so that Terraform can interact with GCP):
  3. Make sure that the GOOGLE_APPLICATION_CREDENTIALS environment variable is set.

Terraform basics

(Video source)

There are 2 important components to Terraform: the code files and Terraform commands.

The set of files used to describe infrastructure in Terraform is known as a Terraform configuration. Terraform configuration files end up in .tf for files wtritten in Terraform language or tf.json for JSON files. A Terraform configuration must be in its own working directory; you cannot have 2 or more separate configurations in the same folder.

Here's a basic main.tf file written in Terraform language with all of the necesary info to describe basic infrastructure:

terraform {
  required_providers {
    google = {
      source = "hashicorp/google"
      version = "3.5.0"
    }
  }
}

provider "google" {
  credentials = file("<NAME>.json")

  project = "<PROJECT_ID>"
  region  = "us-central1"
  zone    = "us-central1-c"
}

resource "google_compute_network" "vpc_network" {
  name = "terraform-network"
}
  • Terraform divides information into blocks, which are defined within braces ({}), similar to Java or C++. However, unlike these languages, statements are not required to end with a semicolon ; but use linebreaks instead.
  • By convention, arguments with single-line values in the same nesting level have their equal signs (=) aligned for easier reading.
  • There are 3 main blocks: terraform, provider and resource. There must only be a single terraform block but there may be multiple provider and resource blocks.
  • The terraform block contains settings:
    • The required_providers sub-block specifies the providers required by the configuration. In this example there's only a single provider which we've called google.
      • A provider is a plugin that Terraform uses to create and manage resources.
      • Each provider needs a source in order to install the right plugin. By default the Hashicorp repository is used, in a similar way to Docker images.
        • hashicorp/google is short for registry.terraform.io/hashicorp/google .
      • Optionally, a provider can have an enforced version. If this is not specified the latest version will be used by default, which could introduce breaking changes in some rare cases.
    • We'll see other settings to use in this block later.
  • The provider block configures a specific provider. Since we only have a single provider, there's only a single provider block for the google provider.
    • The contents of a provider block are provider-specific. The contents in this example are meant for GCP but may be different for AWS or Azure.
    • Some of the variables seen in this example, such as credentials or zone, can be provided by other means which we'll cover later.
  • The resource blocks define the actual components of our infrastructure. In this example we have a single resource.
    • resource blocks have 2 strings before the block: the resource type and the resource name. Together the create the resource ID in the shape of type.name.
    • About resource types:
      • The first prefix of the resource type maps to the name of the provider. For example, the resource type google_compute_network has the prefix google and thus maps to the provider google.
      • The resource types are defined in the Terraform documentation and refer to resources that cloud providers offer. In our example google_compute_network (Terraform documentation link) refers to GCP's Virtual Private Cloud service.
    • Resource names are the internal names that we use in our Terraform configurations to refer to each resource and have no impact on the actual infrastructure.
    • The contents of a resource block are specific to the resource type. Check the Terraform docs to see a list of resource types by provider.
      • In this example, the google_compute_network resource type has a single mandatory argument called name, which is the name that the resource will have within GCP's infrastructure.
        • Do not confuse the resource name with the name argument!

Besides these 3 blocks, there are additional available blocks:

  • Input variables block types are useful for customizing aspects of other blocks without altering the other blocks' source code. They are often referred to as simply variables. They are passed at runtime.
    variable "region" {
        description = "Region for GCP resources. Choose as per your location: https://cloud.google.com/about/locations"
        default = "europe-west6"
        type = string
    }
    • Description:
      • An input variable block starts with the type variable followed by a name of our choosing.
      • The block may contain a number of fields. In this example we use the fields description, type and default.
      • description contains a simple description for documentation purposes.
      • type specifies the accepted value types for the variable
      • If the default field is defined, the variable becomes optional because a default value is already provided by this field. Otherwise, a value must be provided when running the Terraform configuration.
      • For additional fields, check the Terraform docs.
    • Variables must be accessed with the keyword var. and then the name of the variable.
    • In our main.tf file above, we could access this variable inside the google provider block with this line:
      region = var.region
  • Local values block types behave more like constants.
    locals{
        region  = "us-central1"
        zone    = "us-central1-c"
    }
    • Description:
      • Local values may be grouped in one or more blocks of type locals. Local values are often grouped according to usage.
      • Local values are simpler to declare than input variables because they are only a key-value pair.
    • Local values must be accessed with the word local (mind the lack of s at the end!).
      region = local.region
      zone = local.zone

With a configuration ready, you are now ready to create your infrastructure. There are a number of commands that must be followed:

  • terraform init : initialize your work directory by downloading the necessary providers/plugins.
  • terraform fmt (optional): formats your configuration files so that the format is consistent.
  • terraform validate (optional): returns a success message if the configuration is valid and no errors are apparent.
  • terraform plan : creates a preview of the changes to be applied against a remote state, allowing you to review the changes before applying them.
  • terraform apply : applies the changes to the infrastructure.
  • terraform destroy : removes your stack from the infrastructure.

Creating GCP infrastructure with Terraform

(Video source)

We will now create a new main.tf file as well as an auxiliary variables.tf file with all the blocks we will need for our project.

The infrastructure we will need consists of a Cloud Storage Bucket (google_storage-bucket) for our Data Lake and a BigQuery Dataset (google_bigquery_dataset).

In main.tf we will configure the terraform block as follows:

terraform {
  required_version = ">= 1.0"
  backend "local" {}
  required_providers {
    google = {
      source  = "hashicorp/google"
    }
  }
}
  • The required_version field states the minimum Terraform version to be used.
  • The backend field states where we'd like to store the state of the infrastructure. local means that we'll store it locally in our computers. Alternatively, you could store the state online.

The provider will not make use of the credentials field because when we set up GCP access we already created a GOOGLE_APPLICATION_CREDENTIALS env-var which Terraform can read in order to get our authentication keys.

In the variables.tf we will store variables that may change depending on your needs and location. The ones to note are:

  • region may vary depending on your geographical location; change it according to your needs.
  • BQ_DATASET has the name of the table for BigQuery. You may leave it as it is or change it t fit your needs.
  • project is the Project ID of your project in GCP. SInce the ID is unique, it is good practice to have Terraform as for it every time in case the same code is applied on different projects.

You may access main.tf from this link and variables.tf from this link. Take a look at them to understand the details of the implementation. Copy them to a new folder within your work directory so that the subfolder only contains the Terraform configuration files. Now run the following commands:

terraform init

This will download the necessary plugins to connect to GCP and download them to ./.terraform. Now let's plan the infrastructure:

terraform plan

Terraform will ask for your Project ID. Type it and press enter to let Terraform access GCP and figure out what to do. The infrastructure plan will be printed on screen with all the planned changes marked with a + sign next to them.

Let's apply the changes:

terraform apply

You will need to confirm this step by typing yes when prompted. This will create all the necessary components in the infrastructure an return a terraform.tfstate with the current state of the infrastructure.

After you've successfully created the infrastructure, you may destroy it so that it doesn't consume credit unnecessarily:

terraform destroy

Once again, you will have to confirm this step by typing yes when prompted. This will remove your complete stack from the cloud, so only use it when you're 100% sure of it.

Back to the top

Extra content

Setting up a development environment in a Google Cloud VM

If you cannot set up a local development environment, you may use part of the $300 credits of GCP in creating a Cloud VM and access to it via SSH to set up the environment there.

Follow the instructions in this video.

Port mapping and networks in Docker

If you're having issues with Docker and networking (especially if you already have Postgres running locally in your host computer), a videoguide is also available.

Back to the top

Back to Index

Next: Data Ingestion