Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Making GCS -> S3 pipeline #6

Open
charlesbluca opened this issue Sep 3, 2020 · 9 comments
Open

Making GCS -> S3 pipeline #6

charlesbluca opened this issue Sep 3, 2020 · 9 comments

Comments

@charlesbluca
Copy link
Member

Since I completed the initial copy of CMIP6 from GCS to S3, I had been stuck on how to keep the two datasets synchronized moving forward. After getting in contact with our contact at Amazon, I was given the following strategy:

  • Turn on Pub/Sub notifications for gs://cmip6 to capture object creation events
  • Write a handler for the notifications that sends them to AWS as AWS Batch/Lambda jobs (depending on individual object size)
  • Write a Batch/Lambda container to receive the messages and copy files from GCS to S3

The main obstacle I foresee here is that I am unsure what permissions we have to set up Pub/Sub notifications for the copy of CMIP6 on GCS; @rabernat, do we have a point of contact for Google's public dataset program? Beyond this, the other issue to consider is that all catalog/collection files in the root of the bucket will have to be altered after they are uploaded to/updated on S3, as they all use URLs pointing to the GCS bucket.

Once this is done, all that would need to be done is syncing up the data altered in the time between the initial upload and the creation of this pipeline.

@rabernat
Copy link
Contributor

rabernat commented Sep 3, 2020

This sounds like a simple and clever solution.

I've pinged our google public dataset reps (Shane Glass and Michael Hamamoto Tribble) and asked them to weigh in on this issue.

@shanecglass
Copy link

Hey Charles,

I love this solution! Would it be helpful if I set up a pub/sub topic for the bucket that you could subscribe to?

@charlesbluca
Copy link
Member Author

Yes, that would be great! Do you know if it is possible to subscribe to Pub/Sub topics on AWS SNS, or would we need to make a handler through Cloud Functions to send messages?

@shanecglass
Copy link

shanecglass commented Sep 11, 2020

That's a great question, I'm not sure.

You can subscribe to this Pub/Sub topic for updates!
projects/gcp-public-data-noaa-cmip6/topics/cmip6-updates

@charlesbluca
Copy link
Member Author

Thanks Shane, I'll check that out!

Another potential solution proposed in the pangeo-forge meeting was a daily cron job, running through GHA, which uses rclone to synchronize the data incremently. Here is an example used at CarbonPlan to sync between GCS and Azure.

Some important factors to consider for this solution are:

  • Number of objects to keep in sync; this Action isn't moving Zarr stores so the rclone command may need to be altered to reflect the number of objects being copied
  • Rate of data being added to GCS; the frequency of the cron job may not scale up to the amount of data being added

For now, I'll play around with rclone to get a scale of the differences between S3 and GCS since the initial upload.

@charlesbluca
Copy link
Member Author

After playing around with both GHA cron jobs, I think the original method is definitely more efficient and sustainable - with the cron jobs, I've been able to get around 6 TB daily data transferred through rclone sync, but no operation has been able to check the entirety of the two buckets to ensure that they are fully synchronized, so that doesn't seem like a good option once we are only trying to track and keep up with small changes.

Moving forward with the original plan (Pub/Sub -> AWS Batch/Lambda backup operations), I think we will need either Cloud Functions or Cloud Run, triggered by the Pub/Sub topic, in order to send job instructions to AWS. Playing around with the CMIP6 service account, it seems like additional IAM permissions would be needed to deploy a Cloud Function - @shanecglass, do you have any advice on this? Should we add permissions to the existing account or make another specifically for this purpose?

@shanecglass
Copy link

Hey Charles, it would be helpful to better understand your architecture, but generally speaking, the best practice is to have single-purpose service accounts. So for each Cloud Run instance or each Cloud Function, you would want a separate service account to interact with these. However, if all of these steps are within the same Cloud Run instance, you'd be fine using the same service account.

@charlesbluca
Copy link
Member Author

In general, the idea on the GCS side of things would be:

  • Pub/Sub topic posts update on new/changed/deleted file
  • A Cloud Function triggered by the Pub/Sub topic uses this update to submit a job to AWS Batch/Lambda via HTTP

Then, on the AWS side of things:

  • An AWS Batch job would be defined with parameters to either create, update, or delete files in the S3 CMIP6 bucket
  • Upon getting an HTTP request from the Cloud Function, the job would update the necessary files

From my understanding, this should only require one Cloud Function/Run instance, although I can't really give solid numbers as to how often it would be invoked, as the amount of data changed seems to vary irregularly.

@charlesbluca
Copy link
Member Author

charlesbluca commented Nov 4, 2020

Some updates to this pipeline, based on some chats with people from NOAA who have a similar pipeline working. On the Google Cloud side of things:

  • Pub/Sub topic for the bucket with a push subscription to an AWS API Gateway
  • AWS credentials stored using Secret Manager

On the AWS side of things:

  • An API Gateway handler to send the bucket updates off to either:
    • An AWS Lambda function to directly create, update, or delete the files in question on S3
    • An instance of AWS SQS, which would add the Lambda job to a simple queue to be executed

In either case, the biggest question is if we can use the Secret Manager to store AWS credentials - the API Gateway would need to be public, and authentication would be important to ensure that it is not abused.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants