-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-409] Create python comparison script #42
Conversation
…d for getting folders from s3, add add. test coverage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! Please see comments.
def get_duplicated_columns(dataset: pd.DataFrame) -> list: | ||
"""Gets a list of duplicated columns in a dataframe""" | ||
return dataset.columns[dataset.columns.duplicated()].tolist() | ||
|
||
|
||
def has_common_cols(staging_dataset: pd.DataFrame, main_dataset: pd.DataFrame) -> list: | ||
"""Gets the list of common columns between two dataframes""" | ||
common_cols = staging_dataset.columns.intersection(main_dataset.columns).tolist() | ||
return common_cols != [] | ||
|
||
|
||
def get_missing_cols(staging_dataset: pd.DataFrame, main_dataset: pd.DataFrame) -> list: | ||
"""Gets the list of missing columns present in main but not in staging""" | ||
missing_cols = main_dataset.columns.difference(staging_dataset.columns).tolist() | ||
return missing_cols |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say if we were doing the data cleaning, this could be represented with something like: https://aws.amazon.com/glue/features/databrew/. Also, since there's only 3 functions, we don't need to think too heavily about this here, but as data quality is important to us, we may want to look into exploring tools like:
- Cerberus: https://docs.python-cerberus.org/en/stable/
- Great expectations: https://docs.greatexpectations.io/docs/tutorials/quickstart/
That said ^ those tools may be too complicated for the current task at hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just yesterday, I found that datacompy.Compare has functions that handles some of the above (e.g: gets unique columns in one dataset but not the other here: https://github.com/capitalone/datacompy/blob/develop/datacompy/core.py#L220-L230. It's marked as a TODO because I feel like this PR has blown up already so much and having those changes is a bit more involved.
…ort for edge scenarios like no data types in common
…unc for input args, update syntax and make func params more robust, clean up string formatting, move s3 file path def to function
…are_datasets_by_data_type
return compare.report() | ||
|
||
|
||
def add_additional_msg_to_comparison_report( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the string formatting messages scattered about are not pretty/ideal but tried to group/limit them to just this function and compare_datasets_by_data_type
. That being said, happy to re-org... I have been staring at them for too long...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is great!
Added functionality to:
I am going to convert this PR into a final one instead of draft. Going to make a couple more test data first... You can view an example run at:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🦖 LGTM! Tremendous effort!
…n func, remove unused lib
I am running into a Also ran into the This is after running the glue job on all of the data sets even after including Phil's updates with the drop duplicates and deleted samples. To avoid blowing up this PR more than needed, if everyone is okay with the content of this ticket (it does work on small datasets like the ones for ETL-409, and the scope of this ticket is just for parquet comparison), I will merge it, and then create a separate ticket to look into the memory issues. |
@rxu17 I am ok with merging this and creating a separate ticket to track other work. |
Hello! The purpose of this script is to compare the parquet datasets by data type in the established "main" namespace and the new "staging" namespace of the processed data (post-ETL) bucket.
This is just a draft PR because tests are still in progress/need some troubleshooting/should get grouped into classes for better organization but most/all of the code and functionality is complete. However would love to get some feedback/organizational ideas (for example: I feel like some of the functions could just be moved into a common utils module but at the same time if they never need to get used again in this repo/if they ever plan to be refactored, it might be harder to do that then. This PR also unfortunately got pretty big because there were a bunch of edge cases and logic in regards to the flow of to consider when comparing between two datasets.
This code contains the following main functionality:
Does the following comparisons by dataset data type (e.g: dataset_fitbitactivitylogs) ONLY if the datasets meet the following conditions:
Comparisons done:
Added some comments about thoughts on certain code. Also any glue related code will be added as a part of ETL-406