-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-616] Implement Great Expectations to run on parquet data #139
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Quality Gate passedIssues Measures |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose:
This draft PR adds the Great Expectations (GX) Parquet Glue jobs to the Recover ETL workflow. When JSON to Parquet workflow finishes, this will run the GX job per data type.
Currently this just supports running GX with expectations for the
fitbitdailydata
andhealthkitv2workouts
datasets. All other data types will have their jobs error out.Changes:
Highlights big changes
New Code:
add_data_docs_sites
andadd_validation_results_to_store
functions to allow us to add validation results to the validation store and also have the data docs (our GX report) render them since we use a EphemeralDataContext context object without having to create checkpoints, a GX config file (which would likely have us have to confirm to a specific GX repo structure), etc. If we prefer to switch to using a more persistent data content object like FileDataContext, that could be explored further in this ticket.Changes to old Code:
boto3
andurllib3 2.0
. We need to pinurllib3<2
Tests:
Viewable reports at (with AWS VPN turned on):
Sample screenshots of a report:
EDIT: Also added tests to our CI/CD as previously they weren't automatically running all of them each time