Making GCS -> S3 pipeline #6

charlesbluca · 2020-09-03T21:55:56Z

Since I completed the initial copy of CMIP6 from GCS to S3, I had been stuck on how to keep the two datasets synchronized moving forward. After getting in contact with our contact at Amazon, I was given the following strategy:

Turn on Pub/Sub notifications for gs://cmip6 to capture object creation events
Write a handler for the notifications that sends them to AWS as AWS Batch/Lambda jobs (depending on individual object size)
Write a Batch/Lambda container to receive the messages and copy files from GCS to S3

The main obstacle I foresee here is that I am unsure what permissions we have to set up Pub/Sub notifications for the copy of CMIP6 on GCS; @rabernat, do we have a point of contact for Google's public dataset program? Beyond this, the other issue to consider is that all catalog/collection files in the root of the bucket will have to be altered after they are uploaded to/updated on S3, as they all use URLs pointing to the GCS bucket.

Once this is done, all that would need to be done is syncing up the data altered in the time between the initial upload and the creation of this pipeline.

The text was updated successfully, but these errors were encountered:

rabernat · 2020-09-03T23:57:12Z

This sounds like a simple and clever solution.

I've pinged our google public dataset reps (Shane Glass and Michael Hamamoto Tribble) and asked them to weigh in on this issue.

shanecglass · 2020-09-08T21:48:48Z

Hey Charles,

I love this solution! Would it be helpful if I set up a pub/sub topic for the bucket that you could subscribe to?

charlesbluca · 2020-09-08T22:45:29Z

Yes, that would be great! Do you know if it is possible to subscribe to Pub/Sub topics on AWS SNS, or would we need to make a handler through Cloud Functions to send messages?

shanecglass · 2020-09-11T20:36:21Z

That's a great question, I'm not sure.

You can subscribe to this Pub/Sub topic for updates!
projects/gcp-public-data-noaa-cmip6/topics/cmip6-updates

charlesbluca · 2020-09-14T19:33:30Z

Thanks Shane, I'll check that out!

Another potential solution proposed in the pangeo-forge meeting was a daily cron job, running through GHA, which uses rclone to synchronize the data incremently. Here is an example used at CarbonPlan to sync between GCS and Azure.

Some important factors to consider for this solution are:

Number of objects to keep in sync; this Action isn't moving Zarr stores so the rclone command may need to be altered to reflect the number of objects being copied
Rate of data being added to GCS; the frequency of the cron job may not scale up to the amount of data being added

For now, I'll play around with rclone to get a scale of the differences between S3 and GCS since the initial upload.

charlesbluca · 2020-09-23T21:15:25Z

After playing around with both GHA cron jobs, I think the original method is definitely more efficient and sustainable - with the cron jobs, I've been able to get around 6 TB daily data transferred through rclone sync, but no operation has been able to check the entirety of the two buckets to ensure that they are fully synchronized, so that doesn't seem like a good option once we are only trying to track and keep up with small changes.

Moving forward with the original plan (Pub/Sub -> AWS Batch/Lambda backup operations), I think we will need either Cloud Functions or Cloud Run, triggered by the Pub/Sub topic, in order to send job instructions to AWS. Playing around with the CMIP6 service account, it seems like additional IAM permissions would be needed to deploy a Cloud Function - @shanecglass, do you have any advice on this? Should we add permissions to the existing account or make another specifically for this purpose?

shanecglass · 2020-09-28T19:30:35Z

Hey Charles, it would be helpful to better understand your architecture, but generally speaking, the best practice is to have single-purpose service accounts. So for each Cloud Run instance or each Cloud Function, you would want a separate service account to interact with these. However, if all of these steps are within the same Cloud Run instance, you'd be fine using the same service account.

charlesbluca · 2020-10-05T16:48:39Z

In general, the idea on the GCS side of things would be:

Pub/Sub topic posts update on new/changed/deleted file
A Cloud Function triggered by the Pub/Sub topic uses this update to submit a job to AWS Batch/Lambda via HTTP

Then, on the AWS side of things:

An AWS Batch job would be defined with parameters to either create, update, or delete files in the S3 CMIP6 bucket
Upon getting an HTTP request from the Cloud Function, the job would update the necessary files

From my understanding, this should only require one Cloud Function/Run instance, although I can't really give solid numbers as to how often it would be invoked, as the amount of data changed seems to vary irregularly.

charlesbluca · 2020-11-04T18:10:03Z

Some updates to this pipeline, based on some chats with people from NOAA who have a similar pipeline working. On the Google Cloud side of things:

Pub/Sub topic for the bucket with a push subscription to an AWS API Gateway
AWS credentials stored using Secret Manager

On the AWS side of things:

An API Gateway handler to send the bucket updates off to either:
- An AWS Lambda function to directly create, update, or delete the files in question on S3
- An instance of AWS SQS, which would add the Lambda job to a simple queue to be executed

In either case, the biggest question is if we can use the Secret Manager to store AWS credentials - the API Gateway would need to be public, and authentication would be important to ensure that it is not abused.

charlesbluca mentioned this issue Sep 8, 2020

Finalizing & publishing CMIP6 documentation pangeo-data/pangeo-cmip6-cloud#11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making GCS -> S3 pipeline #6

Making GCS -> S3 pipeline #6

charlesbluca commented Sep 3, 2020

rabernat commented Sep 3, 2020

shanecglass commented Sep 8, 2020

charlesbluca commented Sep 8, 2020

shanecglass commented Sep 11, 2020 •

edited

Loading

charlesbluca commented Sep 14, 2020

charlesbluca commented Sep 23, 2020

shanecglass commented Sep 28, 2020

charlesbluca commented Oct 5, 2020

charlesbluca commented Nov 4, 2020 •

edited

Loading

Making GCS -> S3 pipeline #6

Making GCS -> S3 pipeline #6

Comments

charlesbluca commented Sep 3, 2020

rabernat commented Sep 3, 2020

shanecglass commented Sep 8, 2020

charlesbluca commented Sep 8, 2020

shanecglass commented Sep 11, 2020 • edited Loading

charlesbluca commented Sep 14, 2020

charlesbluca commented Sep 23, 2020

shanecglass commented Sep 28, 2020

charlesbluca commented Oct 5, 2020

charlesbluca commented Nov 4, 2020 • edited Loading

shanecglass commented Sep 11, 2020 •

edited

Loading

charlesbluca commented Nov 4, 2020 •

edited

Loading