-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ETL-611] Raw sync lambda #141
Conversation
19b7a58
to
2caef66
Compare
2caef66
to
b2acc74
Compare
b2acc74
to
9f5f1c1
Compare
return | ||
elif key_format == "input" and len(key_components) == 3: | ||
cohort = key_components[1] | ||
result[cohort].append(key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this function not return anything? Are you taking advantage of the mutable dict that you pass in through memory? I feel like that adds vulnerability in the code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function modifies the result
dict in-memory.
I feel like that adds vulnerability in the code
How so? I could make a copy of the dict, update the copy, and then return -- but what's the benefit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I did think of a small benefit. Returning a copy makes the variable update in list_s3_objects
more explicit, and creating a shallow copy within the function can be done without any additional imports and with just one extra line of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably more relevant when making mutable default arguments, but I prefer the way you've re-written it to make the function self contained
https://docs.python-guide.org/writing/gotchas/#:~:text=Python's%20default%20arguments%20are%20evaluated,to%20the%20function%20as%20well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM, I'm going to let @rxu17 do a review here, but I took a look at the tests and they look good.
The most concern I have is around data quality and making sure that this change doesn't introduce bugs in the production data. I wonder if we need GX truly running in production so we can be certain changes don't add unintended bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just did a first pass and had some comments/questions. Great work so far!
Quality Gate passedIssues Measures |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Raw Sync Lambda
This Lambda verifies that the input and raw S3 buckets are synchronized. It's triggered by a Cloudwatch Events rule at midnight UTC each day.
This is accomplished by verifying that all non-zero sized JSON in each export in the input S3 bucket have a corresponding object in the raw S3 bucket. Because we only download the central directory, located near the end of a zip archive, verification can be done extremely quickly and without needing to download most of the export. If a JSON file from an export is found to not have a corresponding object in the raw bucket, the export is submitted to the raw Lambda (via the dispatch SNS topic) for processing.