-
Notifications
You must be signed in to change notification settings - Fork 5
Making GCS -> S3 pipeline #6
Comments
This sounds like a simple and clever solution. I've pinged our google public dataset reps (Shane Glass and Michael Hamamoto Tribble) and asked them to weigh in on this issue. |
Hey Charles, I love this solution! Would it be helpful if I set up a pub/sub topic for the bucket that you could subscribe to? |
Yes, that would be great! Do you know if it is possible to subscribe to Pub/Sub topics on AWS SNS, or would we need to make a handler through Cloud Functions to send messages? |
That's a great question, I'm not sure. You can subscribe to this Pub/Sub topic for updates! |
Thanks Shane, I'll check that out! Another potential solution proposed in the pangeo-forge meeting was a daily cron job, running through GHA, which uses rclone to synchronize the data incremently. Here is an example used at CarbonPlan to sync between GCS and Azure. Some important factors to consider for this solution are:
For now, I'll play around with rclone to get a scale of the differences between S3 and GCS since the initial upload. |
After playing around with both GHA cron jobs, I think the original method is definitely more efficient and sustainable - with the cron jobs, I've been able to get around 6 TB daily data transferred through Moving forward with the original plan (Pub/Sub -> AWS Batch/Lambda backup operations), I think we will need either Cloud Functions or Cloud Run, triggered by the Pub/Sub topic, in order to send job instructions to AWS. Playing around with the CMIP6 service account, it seems like additional IAM permissions would be needed to deploy a Cloud Function - @shanecglass, do you have any advice on this? Should we add permissions to the existing account or make another specifically for this purpose? |
Hey Charles, it would be helpful to better understand your architecture, but generally speaking, the best practice is to have single-purpose service accounts. So for each Cloud Run instance or each Cloud Function, you would want a separate service account to interact with these. However, if all of these steps are within the same Cloud Run instance, you'd be fine using the same service account. |
In general, the idea on the GCS side of things would be:
Then, on the AWS side of things:
From my understanding, this should only require one Cloud Function/Run instance, although I can't really give solid numbers as to how often it would be invoked, as the amount of data changed seems to vary irregularly. |
Some updates to this pipeline, based on some chats with people from NOAA who have a similar pipeline working. On the Google Cloud side of things:
On the AWS side of things:
In either case, the biggest question is if we can use the Secret Manager to store AWS credentials - the API Gateway would need to be public, and authentication would be important to ensure that it is not abused. |
Since I completed the initial copy of CMIP6 from GCS to S3, I had been stuck on how to keep the two datasets synchronized moving forward. After getting in contact with our contact at Amazon, I was given the following strategy:
gs://cmip6
to capture object creation eventsThe main obstacle I foresee here is that I am unsure what permissions we have to set up Pub/Sub notifications for the copy of CMIP6 on GCS; @rabernat, do we have a point of contact for Google's public dataset program? Beyond this, the other issue to consider is that all catalog/collection files in the root of the bucket will have to be altered after they are uploaded to/updated on S3, as they all use URLs pointing to the GCS bucket.
Once this is done, all that would need to be done is syncing up the data altered in the time between the initial upload and the creation of this pipeline.
The text was updated successfully, but these errors were encountered: