New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[ETL-409] Create python comparison script #42

Merged

rxu17 merged 19 commits into main from etl-409

Apr 27, 2023

Contributor

rxu17 commented Apr 14, 2023 •

edited

Loading

Hello! The purpose of this script is to compare the parquet datasets by data type in the established "main" namespace and the new "staging" namespace of the processed data (post-ETL) bucket.

This is just a draft PR because tests are still in progress/need some troubleshooting/should get grouped into classes for better organization but most/all of the code and functionality is complete. However would love to get some feedback/organizational ideas (for example: I feel like some of the functions could just be moved into a common utils module but at the same time if they never need to get used again in this repo/if they ever plan to be refactored, it might be harder to do that then. This PR also unfortunately got pretty big because there were a bunch of edge cases and logic in regards to the flow of to consider when comparing between two datasets.

This code contains the following main functionality:

Compares the data type folders present in each namespace

Does the following comparisons by dataset data type (e.g: dataset_fitbitactivitylogs) ONLY if the datasets meet the following conditions:

Neither dataset is empty
The datasets should have columns in common to compare as a chunk of the comparisons relies on that

Comparisons done:

Compares the two datasets' column data types
Compares the two datasets' row count
Compares the two datasets' column names
Compares the two datasets' column values (once they are the same dim)
Compares the two datasets' row values (once they are the same dim)

Added some comments about thoughts on certain code. Also any glue related code will be added as a part of ETL-406

Rixing Xu added 2 commits

April 14, 2023 10:05


          add initial compare parquet datasets pythob job code, and tests

33b1255


          add docstring and comments

7e9ed35

rxu17 changed the title ~~[ETL-409] Create python comparison script~~ [ETL-409] Create python comparison script (WIP)

rxu17 commented

View reviewed changes

src/glue/jobs/compare_parquet_datasets.py Show resolved Hide resolved

rxu17 commented

View reviewed changes

tests/test_compare_parquet_datasets.py Show resolved Hide resolved

rxu17 commented

View reviewed changes

src/glue/jobs/compare_parquet_datasets.py Outdated Show resolved Hide resolved

rxu17 commented

View reviewed changes

src/glue/jobs/compare_parquet_datasets.py Outdated Show resolved Hide resolved

philerooski reviewed

View reviewed changes

Pipfile Outdated Show resolved Hide resolved

src/glue/jobs/compare_parquet_datasets.py Show resolved Hide resolved

tests/test_compare_parquet_datasets.py Show resolved Hide resolved


          update code to use datacompy as main comparison tool

393cb19

rxu17 temporarily deployed to develop

April 14, 2023 23:23

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 14, 2023 23:25

— with

GitHub Actions Inactive


          remove all unused functions and tests covered by datacompy, fix metho…

4174e0c

…d for getting folders from s3, add add. test coverage

rxu17 temporarily deployed to develop

April 15, 2023 00:19

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 15, 2023 00:20

— with

GitHub Actions Inactive

thomasyu888 requested changes

View reviewed changes

Member

thomasyu888 left a comment

Excellent work! Please see comments.

src/glue/jobs/compare_parquet_datasets.py Outdated Show resolved Hide resolved

src/glue/jobs/compare_parquet_datasets.py Outdated Show resolved Hide resolved

tests/conftest.py Outdated Show resolved Hide resolved

src/glue/jobs/compare_parquet_datasets.py Outdated Show resolved Hide resolved

tests/test_compare_parquet_datasets.py Show resolved Hide resolved

tests/test_compare_parquet_datasets.py Outdated Show resolved Hide resolved

tests/test_compare_parquet_datasets.py Outdated Show resolved Hide resolved

tests/test_compare_parquet_datasets.py Outdated Show resolved Hide resolved

src/glue/jobs/compare_parquet_datasets.py Outdated

Comment on lines 64 to 78

+              def get_duplicated_columns(dataset: pd.DataFrame) -> list:
+                  """Gets a list of duplicated columns in a dataframe"""
+                  return dataset.columns[dataset.columns.duplicated()].tolist()
+              def has_common_cols(staging_dataset: pd.DataFrame, main_dataset: pd.DataFrame) -> list:
+                  """Gets the list of common columns between two dataframes"""
+                  common_cols = staging_dataset.columns.intersection(main_dataset.columns).tolist()
+                  return common_cols != []
+              def get_missing_cols(staging_dataset: pd.DataFrame, main_dataset: pd.DataFrame) -> list:
+                  """Gets the list of missing columns present in main but not in staging"""
+                  missing_cols = main_dataset.columns.difference(staging_dataset.columns).tolist()
+                  return missing_cols

Member

thomasyu888 Apr 17, 2023

I would say if we were doing the data cleaning, this could be represented with something like: https://aws.amazon.com/glue/features/databrew/. Also, since there's only 3 functions, we don't need to think too heavily about this here, but as data quality is important to us, we may want to look into exploring tools like:

Cerberus: https://docs.python-cerberus.org/en/stable/
Great expectations: https://docs.greatexpectations.io/docs/tutorials/quickstart/

That said ^ those tools may be too complicated for the current task at hand.

Contributor Author

rxu17 Apr 20, 2023

Just yesterday, I found that datacompy.Compare has functions that handles some of the above (e.g: gets unique columns in one dataset but not the other here: https://github.com/capitalone/datacompy/blob/develop/datacompy/core.py#L220-L230. It's marked as a TODO because I feel like this PR has blown up already so much and having those changes is a bit more involved.

src/glue/jobs/compare_parquet_datasets.py Show resolved Hide resolved


          add syntax updates, clean up parquet dataset test, add comparison rep…

da6dafc

…ort for edge scenarios like no data types in common

rxu17 temporarily deployed to develop

April 17, 2023 18:42

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 17, 2023 18:43

— with

GitHub Actions Inactive


          add tests for compare_datasets_by_data_type, add a small validation f…

c00090b

…unc for input args, update syntax and make func params more robust, clean up string formatting, move s3 file path def to function

rxu17 temporarily deployed to develop

April 17, 2023 21:47

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 17, 2023 21:48

— with

GitHub Actions Inactive

rxu17 commented

View reviewed changes

tests/test_compare_parquet_datasets.py Show resolved Hide resolved

rxu17 commented

View reviewed changes

src/glue/jobs/compare_parquet_datasets.py Show resolved Hide resolved


          remove debugging, add test for checking has_common_cols logic in comp…

6f9415e

…are_datasets_by_data_type

rxu17 temporarily deployed to develop

April 17, 2023 22:05

— with

GitHub Actions Inactive

rxu17 commented

View reviewed changes

src/glue/jobs/compare_parquet_datasets.py

		return compare.report()


		def add_additional_msg_to_comparison_report(

Contributor Author

rxu17 Apr 17, 2023 •

edited

Loading

All the string formatting messages scattered about are not pretty/ideal but tried to group/limit them to just this function and compare_datasets_by_data_type. That being said, happy to re-org... I have been staring at them for too long...

Member

thomasyu888 Apr 18, 2023

I think this is great!

rxu17 temporarily deployed to develop

April 17, 2023 22:07

— with

GitHub Actions Inactive

rxu17 commented

View reviewed changes

src/glue/jobs/compare_parquet_datasets.py Show resolved Hide resolved

rxu17 requested a review from thomasyu888

April 17, 2023 22:08


          add os.path.join for more robustness

1daffef

Contributor Author

rxu17 commented Apr 20, 2023 •

edited

Loading

Added functionality to:

save csv files of all of the rows that are different between the staging and main datasets in the same area as the high level csv.
When column values are different as well.
Duplicates for each dataset

I am going to convert this PR into a final one instead of draft. Going to make a couple more test data first...

You can view an example run at:

recover-dev-processed-data/etl-409/comparison_result/dataset_healthkitv2samples/parquet_compare.txt
recover-dev-processed-data/etl-409/comparison_result/dataset_healthkitv2samples/all_diff_main_rows.csv
recover-dev-processed-data/etl-409/comparison_result/dataset_healthkitv2samples/all_dup_staging_rows.csv
recover-dev-processed-data/etl-409/comparison_result/dataset_healthkitv2samples/all_dup_main_rows.csv


          add duplicates report, add TODOs, add more robust tests, re-org savin…

ba12476

…g of reports

rxu17 temporarily deployed to develop

April 20, 2023 20:53

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 20, 2023 20:55

— with

GitHub Actions Inactive

rxu17 marked this pull request as ready for review

April 20, 2023 21:25

rxu17 requested a review from a team as a code owner

April 20, 2023 21:25

rxu17 changed the title ~~[ETL-409] Create python comparison script (WIP)~~ [ETL-409] Create python comparison script


          add more TODOs

a3ef38b

rxu17 temporarily deployed to develop

April 20, 2023 21:31

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 20, 2023 21:32

— with

GitHub Actions Inactive

thomasyu888 approved these changes

View reviewed changes

Member

thomasyu888 left a comment

🦖 LGTM! Tremendous effort!


          use resolvedoptions from aws glue, adjust all args and args validatio…

52dcca2

…n func, remove unused lib

rxu17 temporarily deployed to develop

April 21, 2023 17:33

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 21, 2023 17:34

— with

GitHub Actions Inactive

Rixing Xu added 3 commits

April 21, 2023 15:07


          Merge branch 'main' into etl-409

81dfc82


          add datatype subtypes handling

4c96216


          add exception catching by dataset

c2cc21f

Contributor Author

rxu17 commented Apr 24, 2023 •

edited

Loading

I am running into a numpy.core._exceptions._ArrayMemoryError: Unable to allocate 4.43 GiB for an array with shape (71, 8378733) and data type object with the datatype healthkitv2samples so this needs further investigation.

Also ran into the Command Failed due to Out of Memory when I try to bypass the datasets that have the above exception and am just going dataset by dataset and saving their comparisons...

This is after running the glue job on all of the data sets even after including Phil's updates with the drop duplicates and deleted samples. To avoid blowing up this PR more than needed, if everyone is okay with the content of this ticket (it does work on small datasets like the ones for ETL-409, and the scope of this ticket is just for parquet comparison), I will merge it, and then create a separate ticket to look into the memory issues.

rxu17 temporarily deployed to develop

April 24, 2023 18:54

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 24, 2023 18:56

— with

GitHub Actions Inactive


          add conversion to categorical, make it run per data type

3477c18

rxu17 temporarily deployed to develop

April 26, 2023 22:54

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 26, 2023 22:55

— with

GitHub Actions Inactive


          add logging for memory for datasets

2888fe9

Member

thomasyu888 commented Apr 27, 2023

@rxu17 I am ok with merging this and creating a separate ticket to track other work.


          remove categorical conversion, fix test name

c6a4184

rxu17 temporarily deployed to develop

April 27, 2023 18:31

— with

GitHub Actions Inactive

rxu17 temporarily deployed to develop

April 27, 2023 18:32

— with

GitHub Actions Inactive

rxu17 merged commit 0ab9b99 into main

rxu17 deleted the etl-409 branch

April 27, 2023 19:31

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet