Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-611] Raw sync lambda #141

Merged
merged 5 commits into from
Sep 30, 2024
Merged

[ETL-611] Raw sync lambda #141

merged 5 commits into from
Sep 30, 2024

Conversation

philerooski
Copy link
Contributor

@philerooski philerooski commented Sep 17, 2024

Raw Sync Lambda

This Lambda verifies that the input and raw S3 buckets are synchronized. It's triggered by a Cloudwatch Events rule at midnight UTC each day.

This is accomplished by verifying that all non-zero sized JSON in each export in the input S3 bucket have a corresponding object in the raw S3 bucket. Because we only download the central directory, located near the end of a zip archive, verification can be done extremely quickly and without needing to download most of the export. If a JSON file from an export is found to not have a corresponding object in the raw bucket, the export is submitted to the raw Lambda (via the dispatch SNS topic) for processing.

return
elif key_format == "input" and len(key_components) == 3:
cohort = key_components[1]
result[cohort].append(key)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function not return anything? Are you taking advantage of the mutable dict that you pass in through memory? I feel like that adds vulnerability in the code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function modifies the result dict in-memory.

I feel like that adds vulnerability in the code

How so? I could make a copy of the dict, update the copy, and then return -- but what's the benefit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I did think of a small benefit. Returning a copy makes the variable update in list_s3_objects more explicit, and creating a shallow copy within the function can be done without any additional imports and with just one extra line of code.

Copy link
Member

@thomasyu888 thomasyu888 Sep 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably more relevant when making mutable default arguments, but I prefer the way you've re-written it to make the function self contained
https://docs.python-guide.org/writing/gotchas/#:~:text=Python's%20default%20arguments%20are%20evaluated,to%20the%20function%20as%20well.

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM, I'm going to let @rxu17 do a review here, but I took a look at the tests and they look good.

The most concern I have is around data quality and making sure that this change doesn't introduce bugs in the production data. I wonder if we need GX truly running in production so we can be certain changes don't add unintended bugs.

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just did a first pass and had some comments/questions. Great work so far!

src/lambda_function/raw_sync/README.md Show resolved Hide resolved
tests/test_lambda_raw_sync.py Outdated Show resolved Hide resolved
src/lambda_function/raw_sync/app.py Show resolved Hide resolved
src/lambda_function/raw_sync/app.py Show resolved Hide resolved
tests/test_lambda_raw_sync.py Show resolved Hide resolved
src/lambda_function/raw_sync/app.py Show resolved Hide resolved
Copy link

sonarcloud bot commented Sep 27, 2024

Copy link
Contributor

@rxu17 rxu17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@philerooski philerooski merged commit 2a766e2 into main Sep 30, 2024
18 checks passed
@philerooski philerooski deleted the etl-611 branch September 30, 2024 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants