-
Notifications
You must be signed in to change notification settings - Fork 5
Migrate master database to bigquery (instead of CSV) #2
Comments
I am proposing to use BigQuery as the master, but we could still produce a CSV file for convenience. A nightly cron job would do the trick. |
Sounds great, the way I am making the csv file now is very inefficient and not scalable. BigQuery sounds very promising! Here is a summary (sorry about the length) of how we are keeping the CSV file up to date. The original method is what we used to run directly in the cloud on the old pangeo GC collection. Original method:
But then we wanted a two catalogs, one with ALL of the zarr stores and another with just those without serious issues reported to the ES-DOC errata pages. Newer method:
Latest Method: I am now including the The current |
It is very useful to get all this info documented out in public. Length == detail! It's good!
Once we move to BigQuery, presumably we could handle this by just having an extra boolean column, i.e.
Is there a 1:1 mapping between what we call "Dataset ID" (the 8-value tuple) and an ESGF version_id? One thing we might want to consider in using BigQuery is the concept of nested arrays. We could have a column for which each row contains a list. For example, for each row (i.e. each dataset), we could have a column called We could also have timestamps, i.e. |
No, in general each 8-tuple I initially chose to save tracking_ids instead of version_id in the dataset metadata. There is a dataset tracking_id (different than the netcdf tracking_ids) and there is (supposed to be, but at least one modeling center did not do this correctly) a 1:1 mapping between the dataset tracking_ids and version_ids. Users d0 not seem to know how to find version_id from tracking_id (either sort) so I am now adding them explicitly to the catalog. Nested arrays would be good for the netcdf tracking_ids also. Fortunately our current tracking_ids allow us to go back through all of the data and re-create the Thanks so much for doing this - it is a tremendous relief to know there is a better solution in the works! |
Ok I kinda understand! 🤯 This discussion eventually needs to lead to us defining a BigQuery Schema for the table. This can be done via a json file, e.g.:
This will be done via a PR to this repo. I would love if @charlesbluca could take a stab at a first draft for this schema. Then we can continue to discuss / refine via the PR. Charles, is this something you have time for? |
Yes! I'm currently looking into the documentation myself -- I'll begin drafting out a schema definition. |
Opened up a PR with my first draft. |
Now that we've imported the table to BigQuery, I'll start looking into making scripts to make additions to it - @naomi-henderson, do you have any scripts I could look at to give a sense of what functionality we need? |
Well, as I explained above, I don't add datasets one-by-one, so don't really have a script for you. I guess we need to give BigQuery various pieces of information, starting with the e.g.:
using:
Here are my relevant notebooks but I don't think they will be very helpful: |
Opened up a PR #5 with some simple scripts that allow rows to be inserted into the BigQuery table. |
@naomi-henderson, were you able to check out these scripts? They show relatively simple examples of inserting to the BigQuery table, but it wouldn't be too difficult to expand upon them if need be. |
We currently use .csv files as our master source of truth on what is in the CMIP6 cloud archive:
A more robust and cloud-native way to do this would be to use BigQuery, Google Cloud's database product, to store this information. Then we could run SQL queries on the database, rather than downloading a big CSV file every time.
@charlesbluca, could you play around with exporting the CSV into BigQuery?
@naomi-henderson, can you summarize the process we are currently using to keep this CSV file up to date?
The text was updated successfully, but these errors were encountered: