[ETL-616] Implement Great Expectations to run on parquet data #139

rxu17 · 2024-09-06T02:39:58Z

Purpose:

This draft PR adds the Great Expectations (GX) Parquet Glue jobs to the Recover ETL workflow. When JSON to Parquet workflow finishes, this will run the GX job per data type.

Currently this just supports running GX with expectations for the fitbitdailydata and healthkitv2workouts datasets. All other data types will have their jobs error out.

Changes:

Highlights big changes

New Code:

run_great_expectations_on_parquet.py : This is the script to be run by the GX on Parquet jobs. There are some workarounds in the script notably in the add_data_docs_sites and add_validation_results_to_store functions to allow us to add validation results to the validation store and also have the data docs (our GX report) render them since we use a EphemeralDataContext context object without having to create checkpoints, a GX config file (which would likely have us have to confirm to a specific GX repo structure), etc. If we prefer to switch to using a more persistent data content object like FileDataContext, that could be explored further in this ticket.
data values validation suite: this will be where we manually add our expectations to in the future. I find this way the easiest way because we have to look through the expectations gallery in order to find the expectation we want to add and in case we want to be able to validate our outputted set of expectations against this list.

Changes to old Code:

Updated glue workflow triggers : because there is a limitation to the number of jobs per trigger (50 max), I split the previous trigger into two, one for each set of jobs for CompareParquet and the new GX on Parquet jobs.
Pinned dependency urllib3<2 as part of additional python modules: There is a compatibility issue with boto3 and urllib3 2.0. We need to pin urllib3<2

Tests:

Integration testing in AWS (currently running)
Unit tests
Tests that script can produce reports from sample validation suite in the shareable artifacts bucket

Viewable reports at (with AWS VPN turned on):

Sample screenshots of a report:

EDIT: Also added tests to our CI/CD as previously they weren't automatically running all of them each time

sonarcloud · 2024-09-12T17:48:29Z

Quality Gate passed

Issues
42 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

rxu17 added 12 commits September 4, 2024 13:59

initial commit for testing

ac378fd

update sample expectations

35c8422

add two data types

f42717e

correct to fitbitdailydata

5483162

fix expectation

4610016

add complete script

c4b85f4

initial cf config and template

9d1a0d5

correct formatting, refactor triggers

6ae0e24

fix job name

26e31e9

refactor gx code, add tests, adjust gx version

8e45790

refactor gx code, add tests, adjust gx version

2ad2f61

make consistent naming

3cd0422

rxu17 temporarily deployed to develop September 6, 2024 02:47 — with GitHub Actions Inactive

remove hardcoded args

2dae40e

rxu17 temporarily deployed to develop September 6, 2024 02:50 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 6, 2024 02:53 — with GitHub Actions Inactive

rxu17 had a problem deploying to develop September 6, 2024 02:53 — with GitHub Actions Failure

add integration tests, remove null rows code, add dep for urllib3<2

6a07092

rxu17 temporarily deployed to develop September 6, 2024 04:18 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 6, 2024 04:21 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 11, 2024 19:46 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 11, 2024 19:49 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 11, 2024 19:56 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 11, 2024 19:58 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 11, 2024 19:59 — with GitHub Actions Inactive

add gx glue version as var in config

0ec67b1

rxu17 temporarily deployed to develop September 12, 2024 06:52 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 12, 2024 06:55 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 12, 2024 07:01 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 12, 2024 07:03 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 12, 2024 07:04 — with GitHub Actions Inactive

merge conflicts

4a2c9f9

rxu17 temporarily deployed to develop September 12, 2024 17:48 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 12, 2024 17:51 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 12, 2024 17:57 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 12, 2024 18:00 — with GitHub Actions Inactive

rxu17 temporarily deployed to develop September 12, 2024 18:01 — with GitHub Actions Inactive

rxu17 merged commit 30d1873 into main Sep 13, 2024
18 checks passed

rxu17 deleted the etl-616 branch September 13, 2024 08:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-616] Implement Great Expectations to run on parquet data #139

[ETL-616] Implement Great Expectations to run on parquet data #139

rxu17 commented Sep 6, 2024 •

edited

Loading

sonarcloud bot commented Sep 12, 2024

[ETL-616] Implement Great Expectations to run on parquet data #139

[ETL-616] Implement Great Expectations to run on parquet data #139

Conversation

rxu17 commented Sep 6, 2024 • edited Loading

Purpose:

Changes:

Tests:

sonarcloud bot commented Sep 12, 2024

Quality Gate passed

rxu17 commented Sep 6, 2024 •

edited

Loading