Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ETL-397] Create s3-event-config lambda and dependencies to add s3 notification configuration #50

Merged
merged 7 commits into from
May 16, 2023

Conversation

rxu17
Copy link
Contributor

@rxu17 rxu17 commented May 10, 2023

Purpose: This PR adds the following resources seen here: ETL-397.

This allows us to add the S3 notification configuration to our existing input data buckets (via the new lambda), and also allow our S3 to glue lambda to now be triggered by those S3 new object notifications and start our recover ETL processing. See the ticket for why we were not able to add those changes to our pre-existing lambda resources/stacks/templates and had to do it through a separate lambda function.

I put this PR up as a draft in case there's some refactoring/structural changes we would like to make. Still testing it on the pipeline but it works for one dataset I uploaded. Also I separated out the implementation of the GH action to trigger this lambda here in ETL-446

@rxu17 rxu17 temporarily deployed to develop May 11, 2023 01:21 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop May 11, 2023 01:23 — with GitHub Actions Inactive
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start, some initial comments.

src/lambda_function/s3_event_config/app.py Outdated Show resolved Hide resolved
src/lambda_function/s3_event_config/app.py Outdated Show resolved Hide resolved
src/lambda_function/s3_event_config/app.py Outdated Show resolved Hide resolved
logger.info("Put request completed....")


def delete_notification(bucket: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to make sure that I understand that this is a lambda that adds the lambda function that will trigger jobs for S3 event notifications.

If so, how will the delete notification be triggered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image A rough diagram of where this all falls in place. The delete notification is in place in case we'd like to delete the notification configuration/would like a more automated way of deleting the configuration. If we'd rather do that via AWS console, we can remove that code.

@rxu17 rxu17 temporarily deployed to develop May 11, 2023 18:29 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop May 11, 2023 18:31 — with GitHub Actions Inactive
Copy link
Contributor

@philerooski philerooski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing 🤯

I'm curious, how does this play out in staging vs main namespace in the production account? When we merge a PR does the event trigger get configured in staging? This would cause any new data that comes in to be processed by staging as well, which I don't think is desirable.

BUT, maybe that's not necessarily the case. If we don't trigger staging's s3_event_config lambda in the Github CI/CD then the S3 event trigger never gets configured for staging, right?

src/lambda_function/s3_event_config/app.py Outdated Show resolved Hide resolved
@rxu17
Copy link
Contributor Author

rxu17 commented May 12, 2023

Amazing 🤯

I'm curious, how does this play out in staging vs main namespace in the production account? When we merge a PR does the event trigger get configured in staging? This would cause any new data that comes in to be processed by staging as well, which I don't think is desirable.

BUT, maybe that's not necessarily the case. If we don't trigger staging's s3_event_config lambda in the Github CI/CD then the S3 event trigger never gets configured for staging, right?

Yes, as long as we don't trigger staging's s3_event_config lambda in the Github CI/CD, the S3 event trigger should not get triggered for any data that gets put into the staging namespace folder of the input data bucket.

@rxu17
Copy link
Contributor Author

rxu17 commented May 13, 2023

From my research (see ticket for more info), there doesn't exist a way to allow the S3 to JSON lambda to process multiple S3 object notifications as a group besides implementing the SQS architecture. I've created another ticket ETL-450 to explore this.

@rxu17
Copy link
Contributor Author

rxu17 commented May 13, 2023

Also, I tested this on copying one dataset (using the recover pilot datasets) at a time to the bucket (and letting the glue workflow run all the way through), and I wasn't able to reproduce the random JSON to Parquet job errors I got previously. I feel like that's likely due to one or more of the following:

  • I didn't upload/copy more than one file to the bucket at a time and waited for the glue workflow to run completely
  • I only had S3 notification happen when a file gets uploaded to the specific namespace of the input data bucket
  • Glue was being weird

@rxu17 rxu17 marked this pull request as ready for review May 15, 2023 18:45
@rxu17 rxu17 requested a review from a team as a code owner May 15, 2023 18:45
@rxu17 rxu17 temporarily deployed to develop May 15, 2023 22:04 — with GitHub Actions Inactive
@rxu17 rxu17 temporarily deployed to develop May 15, 2023 22:05 — with GitHub Actions Inactive
@rxu17 rxu17 requested a review from philerooski May 15, 2023 22:09
@@ -16,3 +16,4 @@ synapseclient = "~=2.7"
pandas = "<1.5"
moto = "~=4.1"
datacompy = "~=0.8"
docker = "~=6.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's dependency of using mock_lambda from moto3 but for some reason installing moto3 doesn't automatically install docker as a dependency

Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM!

@rxu17 rxu17 merged commit 4dbff17 into main May 16, 2023
@rxu17 rxu17 deleted the etl-397 branch May 16, 2023 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants