In this hands-on workshop, we’ll learn how to build data ingestion pipelines.
We’ll cover the following steps:
- Extracting data from APIs, or files.
- Normalizing and loading data
- Incremental loading
By the end of this workshop, you’ll be able to write data pipelines like a senior data engineer: Quickly, concisely, scalable, and self-maintaining.
If you don't follow the course and only want to attend the workshop, sign up here: https://lu.ma/wupfy6dd
- Website and community: Visit our docs, discuss on our slack (Link at top of docs).
- Course colab: Notebook.
- dlthub community Slack.
Welcome to the data talks club data engineering zoomcamp, the data ingestion workshop.
- My name is Adrian, and I work in the data field since 2012
- I built many data warehouses some lakes, and a few data teams
- 10 years into my career I started working on dlt “data load tool”, which is an open source library to enable data engineers to build faster and better.
- I started working on dlt because data engineering is one of the few areas of software engineering where we do not have developer tools to do our work.
- Building better pipelines would require more code re-use - we cannot all just build perfect pipelines from scratch every time.
- And so dlt was born, a library that automates the tedious part of data ingestion: Loading, schema management, data type detection, scalability, self healing, scalable extraction… you get the idea - essentially a data engineer’s “one stop shop” for best practice data pipelining.
- Due to its simplicity of use, dlt enables laymen to
- Build pipelines 5-10x faster than without it
- Build self healing, self maintaining pipelines with all the best practices of data engineers. Automating schema changes removes the bulk of maintenance efforts.
- Govern your pipelines with schema evolution alerts and data contracts.
- and generally develop pipelines like a senior, commercial data engineer.
You can find the course file here The course has 3 parts
- Extraction Section: In this section we will learn about scalable extraction
- Normalisation Section: In this section we will learn to prepare data for loading
- Loading Section): Here we will learn about incremental loading modes
The linked colab notebook offers a few exercises to practice what you learned today.
- A: 10.23433234744176
- B: 7.892332347441762
- C: 8.382332347441762
- D: 9.123332347441762
- A: 4.236551275463989
- B: 3.605551275463989
- C: 2.345551275463989
- D: 5.678551275463989
Question 3: Append the 2 generators. After correctly appending the data, calculate the sum of all ages of people.
- A: 353
- B: 365
- C: 378
- D: 390
Question 4: Merge the 2 generators using the ID column. Calculate the sum of ages of all the people loaded as described above.
- A: 205
- B: 213
- C: 221
- D: 230
As you are learning the various concepts of data engineering, consider creating a portfolio project that will further your own knowledge.
By demonstrating the ability to deliver end to end, you will have an easier time finding your first role. This will help regardless of whether your hiring manager reviews your project, largely because you will have a better understanding and will be able to talk the talk.
Here are some example projects that others did with dlt:
- Serverless dlt-dbt on cloud functions: Article
- Bird finder: Part 1, Part 2
- Event ingestion on GCP: Article and repo
- Event ingestion on AWS: Article and repo
- Or see one of the many demos created by our working students: Hacker news, GA4 events, an E-Commerce, google sheets, Motherduck, MongoDB + Holistics, Deepnote, Prefect, PowerBI vs GoodData vs Metabase, Dagster, Ingesting events via gcp webhooks, SAP to snowflake replication, Read emails and send sumamry to slack with AI and Kestra, Mode +dlt capabilities, dbt on cloud functions
If you create a personal project, consider submitting it to our blog - we will be happy to showcase it. Just drop us a line in the dlt slack.
And don't forget, if you like dlt
- Give us a GitHub Star!
- Join our Slack community