Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-409] Create python comparison script #42

Merged
merged 19 commits into from
Apr 27, 2023
Merged

[ETL-409] Create python comparison script #42

merged 19 commits into from
Apr 27, 2023

Conversation

rxu17
Copy link
Contributor

@rxu17 rxu17 commented Apr 14, 2023

Hello! The purpose of this script is to compare the parquet datasets by data type in the established "main" namespace and the new "staging" namespace of the processed data (post-ETL) bucket.

This is just a draft PR because tests are still in progress/need some troubleshooting/should get grouped into classes for better organization but most/all of the code and functionality is complete. However would love to get some feedback/organizational ideas (for example: I feel like some of the functions could just be moved into a common utils module but at the same time if they never need to get used again in this repo/if they ever plan to be refactored, it might be harder to do that then. This PR also unfortunately got pretty big because there were a bunch of edge cases and logic in regards to the flow of to consider when comparing between two datasets.

This code contains the following main functionality:

  • Compares the data type folders present in each namespace

Does the following comparisons by dataset data type (e.g: dataset_fitbitactivitylogs) ONLY if the datasets meet the following conditions:

  • Neither dataset is empty
  • The datasets should have columns in common to compare as a chunk of the comparisons relies on that

Comparisons done:

  • Compares the two datasets' column data types
  • Compares the two datasets' row count
  • Compares the two datasets' column names
  • Compares the two datasets' column values (once they are the same dim)
  • Compares the two datasets' row values (once they are the same dim)

Added some comments about thoughts on certain code. Also any glue related code will be added as a part of ETL-406

@rxu17 rxu17 changed the title [ETL-409] Create python comparison script [ETL-409] Create python comparison script (WIP) Apr 14, 2023
Pipfile Outdated Show resolved Hide resolved
src/glue/jobs/compare_parquet_datasets.py Show resolved Hide resolved
tests/test_compare_parquet_datasets.py Show resolved Hide resolved
@rxu17 rxu17 temporarily deployed to develop April 14, 2023 23:23 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 14, 2023 23:25 — with GitHub Actions Inactive
…d for getting folders from s3, add add. test coverage
@rxu17 rxu17 temporarily deployed to develop April 15, 2023 00:19 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 15, 2023 00:20 — with GitHub Actions Inactive
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! Please see comments.

src/glue/jobs/compare_parquet_datasets.py Outdated Show resolved Hide resolved
src/glue/jobs/compare_parquet_datasets.py Outdated Show resolved Hide resolved
tests/conftest.py Outdated Show resolved Hide resolved
src/glue/jobs/compare_parquet_datasets.py Outdated Show resolved Hide resolved
tests/test_compare_parquet_datasets.py Show resolved Hide resolved
tests/test_compare_parquet_datasets.py Outdated Show resolved Hide resolved
tests/test_compare_parquet_datasets.py Outdated Show resolved Hide resolved
tests/test_compare_parquet_datasets.py Outdated Show resolved Hide resolved
Comment on lines 64 to 78
def get_duplicated_columns(dataset: pd.DataFrame) -> list:
"""Gets a list of duplicated columns in a dataframe"""
return dataset.columns[dataset.columns.duplicated()].tolist()


def has_common_cols(staging_dataset: pd.DataFrame, main_dataset: pd.DataFrame) -> list:
"""Gets the list of common columns between two dataframes"""
common_cols = staging_dataset.columns.intersection(main_dataset.columns).tolist()
return common_cols != []


def get_missing_cols(staging_dataset: pd.DataFrame, main_dataset: pd.DataFrame) -> list:
"""Gets the list of missing columns present in main but not in staging"""
missing_cols = main_dataset.columns.difference(staging_dataset.columns).tolist()
return missing_cols
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say if we were doing the data cleaning, this could be represented with something like: https://aws.amazon.com/glue/features/databrew/. Also, since there's only 3 functions, we don't need to think too heavily about this here, but as data quality is important to us, we may want to look into exploring tools like:

That said ^ those tools may be too complicated for the current task at hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just yesterday, I found that datacompy.Compare has functions that handles some of the above (e.g: gets unique columns in one dataset but not the other here: https://github.com/capitalone/datacompy/blob/develop/datacompy/core.py#L220-L230. It's marked as a TODO because I feel like this PR has blown up already so much and having those changes is a bit more involved.

src/glue/jobs/compare_parquet_datasets.py Show resolved Hide resolved
…ort for edge scenarios like no data types in common
@rxu17 rxu17 temporarily deployed to develop April 17, 2023 18:42 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 17, 2023 18:43 — with GitHub Actions Inactive
…unc for input args, update syntax and make func params more robust, clean up string formatting, move s3 file path def to function
@rxu17 rxu17 temporarily deployed to develop April 17, 2023 21:47 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 17, 2023 21:48 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 17, 2023 22:05 — with GitHub Actions Inactive
return compare.report()


def add_additional_msg_to_comparison_report(
Copy link
Contributor Author

@rxu17 rxu17 Apr 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the string formatting messages scattered about are not pretty/ideal but tried to group/limit them to just this function and compare_datasets_by_data_type. That being said, happy to re-org... I have been staring at them for too long...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is great!

@rxu17 rxu17 temporarily deployed to develop April 17, 2023 22:07 — with GitHub Actions Inactive
@rxu17 rxu17 requested a review from thomasyu888 April 17, 2023 22:08
@rxu17
Copy link
Contributor Author

rxu17 commented Apr 20, 2023

Added functionality to:

  • save csv files of all of the rows that are different between the staging and main datasets in the same area as the high level csv.
  • When column values are different as well.
  • Duplicates for each dataset

I am going to convert this PR into a final one instead of draft. Going to make a couple more test data first...

You can view an example run at:

  • recover-dev-processed-data/etl-409/comparison_result/dataset_healthkitv2samples/parquet_compare.txt
  • recover-dev-processed-data/etl-409/comparison_result/dataset_healthkitv2samples/all_diff_main_rows.csv
  • recover-dev-processed-data/etl-409/comparison_result/dataset_healthkitv2samples/all_dup_staging_rows.csv
  • recover-dev-processed-data/etl-409/comparison_result/dataset_healthkitv2samples/all_dup_main_rows.csv

@rxu17 rxu17 temporarily deployed to develop April 20, 2023 20:53 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 20, 2023 20:55 — with GitHub Actions Inactive
@rxu17 rxu17 marked this pull request as ready for review April 20, 2023 21:25
@rxu17 rxu17 requested a review from a team as a code owner April 20, 2023 21:25
@rxu17 rxu17 changed the title [ETL-409] Create python comparison script (WIP) [ETL-409] Create python comparison script Apr 20, 2023
@rxu17 rxu17 temporarily deployed to develop April 20, 2023 21:31 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 20, 2023 21:32 — with GitHub Actions Inactive
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🦖 LGTM! Tremendous effort!

@rxu17 rxu17 temporarily deployed to develop April 21, 2023 17:33 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 21, 2023 17:34 — with GitHub Actions Inactive
@rxu17
Copy link
Contributor Author

rxu17 commented Apr 24, 2023

I am running into a numpy.core._exceptions._ArrayMemoryError: Unable to allocate 4.43 GiB for an array with shape (71, 8378733) and data type object with the datatype healthkitv2samples so this needs further investigation.

Also ran into the Command Failed due to Out of Memory when I try to bypass the datasets that have the above exception and am just going dataset by dataset and saving their comparisons...

This is after running the glue job on all of the data sets even after including Phil's updates with the drop duplicates and deleted samples. To avoid blowing up this PR more than needed, if everyone is okay with the content of this ticket (it does work on small datasets like the ones for ETL-409, and the scope of this ticket is just for parquet comparison), I will merge it, and then create a separate ticket to look into the memory issues.

@rxu17 rxu17 temporarily deployed to develop April 24, 2023 18:54 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 24, 2023 18:56 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 26, 2023 22:54 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 26, 2023 22:55 — with GitHub Actions Inactive
@thomasyu888
Copy link
Member

@rxu17 I am ok with merging this and creating a separate ticket to track other work.

@rxu17 rxu17 temporarily deployed to develop April 27, 2023 18:31 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop April 27, 2023 18:32 — with GitHub Actions Inactive
@rxu17 rxu17 merged commit 0ab9b99 into main Apr 27, 2023
@rxu17 rxu17 deleted the etl-409 branch April 27, 2023 19:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants