[ETL-611] Raw sync lambda #141

philerooski · 2024-09-17T23:34:55Z

Raw Sync Lambda

This Lambda verifies that the input and raw S3 buckets are synchronized. It's triggered by a Cloudwatch Events rule at midnight UTC each day.

This is accomplished by verifying that all non-zero sized JSON in each export in the input S3 bucket have a corresponding object in the raw S3 bucket. Because we only download the central directory, located near the end of a zip archive, verification can be done extremely quickly and without needing to download most of the export. If a JSON file from an export is found to not have a corresponding object in the raw bucket, the export is submitted to the raw Lambda (via the dispatch SNS topic) for processing.

src/lambda_function/raw_sync/app.py

thomasyu888 · 2024-09-24T06:54:28Z

src/lambda_function/raw_sync/app.py

+                return
+        elif key_format == "input" and len(key_components) == 3:
+            cohort = key_components[1]
+            result[cohort].append(key)


Does this function not return anything? Are you taking advantage of the mutable dict that you pass in through memory? I feel like that adds vulnerability in the code

The function modifies the result dict in-memory.

I feel like that adds vulnerability in the code

How so? I could make a copy of the dict, update the copy, and then return -- but what's the benefit?

Actually, I did think of a small benefit. Returning a copy makes the variable update in list_s3_objects more explicit, and creating a shallow copy within the function can be done without any additional imports and with just one extra line of code.

This is probably more relevant when making mutable default arguments, but I prefer the way you've re-written it to make the function self contained
https://docs.python-guide.org/writing/gotchas/#:~:text=Python's%20default%20arguments%20are%20evaluated,to%20the%20function%20as%20well.

src/lambda_function/raw_sync/app.py

thomasyu888

🔥 LGTM, I'm going to let @rxu17 do a review here, but I took a look at the tests and they look good.

The most concern I have is around data quality and making sure that this change doesn't introduce bugs in the production data. I wonder if we need GX truly running in production so we can be certain changes don't add unintended bugs.

rxu17

Just did a first pass and had some comments/questions. Great work so far!

src/lambda_function/raw_sync/README.md

tests/test_lambda_raw_sync.py

src/lambda_function/raw_sync/app.py

tests/test_lambda_raw_sync.py

src/lambda_function/raw_sync/app.py

sonarcloud · 2024-09-27T18:16:03Z

Quality Gate passed

Issues
72 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

rxu17

LGTM!

philerooski temporarily deployed to develop September 17, 2024 23:35 — with GitHub Actions Inactive

philerooski had a problem deploying to develop September 17, 2024 23:37 — with GitHub Actions Error

philerooski had a problem deploying to develop September 17, 2024 23:37 — with GitHub Actions Failure

philerooski force-pushed the etl-611 branch from 19b7a58 to 2caef66 Compare September 17, 2024 23:42

philerooski temporarily deployed to develop September 17, 2024 23:42 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 17, 2024 23:45 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 17, 2024 23:52 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 17, 2024 23:56 — with GitHub Actions Inactive

philerooski force-pushed the etl-611 branch from 2caef66 to b2acc74 Compare September 18, 2024 16:09

philerooski had a problem deploying to develop September 18, 2024 16:10 — with GitHub Actions Failure

philerooski temporarily deployed to develop September 18, 2024 16:10 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 18, 2024 16:12 — with GitHub Actions Inactive

philerooski force-pushed the etl-611 branch from b2acc74 to 9f5f1c1 Compare September 18, 2024 16:23

philerooski temporarily deployed to develop September 18, 2024 16:23 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 18, 2024 16:26 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 19, 2024 03:12 — with GitHub Actions Inactive

philerooski marked this pull request as ready for review September 19, 2024 03:19

philerooski requested a review from a team as a code owner September 19, 2024 03:19

philerooski temporarily deployed to develop September 19, 2024 03:27 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 19, 2024 03:29 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 19, 2024 03:30 — with GitHub Actions Inactive

thomasyu888 requested a review from rxu17 September 20, 2024 21:51

thomasyu888 reviewed Sep 24, 2024

View reviewed changes

src/lambda_function/raw_sync/app.py Show resolved Hide resolved

thomasyu888 reviewed Sep 24, 2024

View reviewed changes

src/lambda_function/raw_sync/app.py Outdated Show resolved Hide resolved

thomasyu888 approved these changes Sep 24, 2024

View reviewed changes

rxu17 reviewed Sep 26, 2024

View reviewed changes

raw sync lambda minor improvements

ab02435

philerooski temporarily deployed to develop September 27, 2024 18:16 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 27, 2024 18:19 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 27, 2024 18:33 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 27, 2024 18:36 — with GitHub Actions Inactive

philerooski temporarily deployed to develop September 27, 2024 18:37 — with GitHub Actions Inactive

rxu17 self-requested a review September 27, 2024 23:45

rxu17 approved these changes Sep 27, 2024

View reviewed changes

philerooski merged commit 2a766e2 into main Sep 30, 2024
18 checks passed

philerooski deleted the etl-611 branch September 30, 2024 16:11

philerooski mentioned this pull request Oct 2, 2024

clean up raw bucket before integration test #145

Merged

philerooski mentioned this pull request Oct 16, 2024

Enable raw sync events rule #150

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ETL-611] Raw sync lambda #141

[ETL-611] Raw sync lambda #141

philerooski commented Sep 17, 2024 •

edited

Loading

thomasyu888 Sep 24, 2024

philerooski Sep 26, 2024

philerooski Sep 27, 2024

thomasyu888 Sep 28, 2024 •

edited

Loading

thomasyu888 left a comment

rxu17 left a comment

sonarcloud bot commented Sep 27, 2024

rxu17 left a comment

[ETL-611] Raw sync lambda #141

[ETL-611] Raw sync lambda #141

Conversation

philerooski commented Sep 17, 2024 • edited Loading

Raw Sync Lambda

thomasyu888 Sep 24, 2024

Choose a reason for hiding this comment

philerooski Sep 26, 2024

Choose a reason for hiding this comment

philerooski Sep 27, 2024

Choose a reason for hiding this comment

thomasyu888 Sep 28, 2024 • edited Loading

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

rxu17 left a comment

Choose a reason for hiding this comment

sonarcloud bot commented Sep 27, 2024

Quality Gate passed

rxu17 left a comment

Choose a reason for hiding this comment

philerooski commented Sep 17, 2024 •

edited

Loading

thomasyu888 Sep 28, 2024 •

edited

Loading