Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Migrate master database to bigquery (instead of CSV) #2

Open
rabernat opened this issue Jun 18, 2020 · 11 comments
Open

Migrate master database to bigquery (instead of CSV) #2

rabernat opened this issue Jun 18, 2020 · 11 comments

Comments

@rabernat
Copy link
Contributor

We currently use .csv files as our master source of truth on what is in the CMIP6 cloud archive:

A more robust and cloud-native way to do this would be to use BigQuery, Google Cloud's database product, to store this information. Then we could run SQL queries on the database, rather than downloading a big CSV file every time.

@charlesbluca, could you play around with exporting the CSV into BigQuery?

@naomi-henderson, can you summarize the process we are currently using to keep this CSV file up to date?

@rabernat
Copy link
Contributor Author

I am proposing to use BigQuery as the master, but we could still produce a CSV file for convenience. A nightly cron job would do the trick.

@naomi-henderson
Copy link
Contributor

Sounds great, the way I am making the csv file now is very inefficient and not scalable. BigQuery sounds very promising!

Here is a summary (sorry about the length) of how we are keeping the CSV file up to date. The original method is what we used to run directly in the cloud on the old pangeo GC collection.

Original method:

  1. make a listing of all files in GC, parse the dataset ID tuple and the zarr store URL from the file names and convert to csv file with the keyword names in the first row.

But then we wanted a two catalogs, one with ALL of the zarr stores and another with just those without serious issues reported to the ES-DOC errata pages.

Newer method:

  1. List all files in GC (actually I now separately parse all files starting with 'activity_id/institution_id/' for all (activity_id,institution_id) tuples - and then update only those which have had new zarr stores added).
  2. Get listing of all files at ES-DOC, the severity and id of the issue.
    3a. The https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores-noQC.csv noQC (no Quality Control) master catalog is made with the original method, adding the ES-DOC information for affected files.
    3b. The https://cmip6.storage.googleapis.com/cmip6-zarr-consolidated-stores.csv catalog is then constructed by including all entries from the noQC catalog without serious issues.

Latest Method:

I am now including the version_id of each zarr store in the noQC master catalog. The 'version_id can be found from the dataset tracking_id (a data handle). For this I use the DKRZ's PID Landing Page Service. For example, here is what you get from the dataset tracking_id = 'hdl:21.14100/825d13d0-ecc1-33e6-aa56-1092b057d0e5'. The json response can be parsed for the version_id, for this example version_id = '20190429'. For more details, there is a function, get_version, in my repository module identify.py.

The current noQC catalog already has all version ids, so we only need to call get_version on new zarr stores to construct the new noQC master catalog.

@rabernat
Copy link
Contributor Author

Here is a summary (sorry about the length) of how we are keeping the CSV file up to date.

It is very useful to get all this info documented out in public. Length == detail! It's good!

But then we wanted a two catalogs, one with ALL of the zarr stores and another with just those without serious issues

Once we move to BigQuery, presumably we could handle this by just having an extra boolean column, i.e. QC_flag, which is either True or False. Correct? That way we can just have a single database.

I am now including the version_id of each zarr store in the noQC master catalog.

Is there a 1:1 mapping between what we call "Dataset ID" (the 8-value tuple) and an ESGF version_id?

One thing we might want to consider in using BigQuery is the concept of nested arrays. We could have a column for which each row contains a list. For example, for each row (i.e. each dataset), we could have a column called source_URLs which stores a list of the http endpoints from which the dataset was constructed. Or any other information that would be useful for provenance tracking.

We could also have timestamps, i.e. creation_date, update_date, etc. Basically whatever would be useful for us to keep track of things.

@naomi-henderson
Copy link
Contributor

naomi-henderson commented Jun 18, 2020

No, in general each 8-tuple Dataset ID can have multiple version_ids. In our GC collection, there is only one version_id available (I overwrite them when new versions are published). That is why I started adding version_id to the CSV file. I did not want to make it part of the filename/path because that multiplicity can cause problems for the user trying to find data just using globbing. Now that we find data using the catalog, it would not be such a problem.

I initially chose to save tracking_ids instead of version_id in the dataset metadata. There is a dataset tracking_id (different than the netcdf tracking_ids) and there is (supposed to be, but at least one modeling center did not do this correctly) a 1:1 mapping between the dataset tracking_ids and version_ids. Users d0 not seem to know how to find version_id from tracking_id (either sort) so I am now adding them explicitly to the catalog.

Nested arrays would be good for the netcdf tracking_ids also. Fortunately our current tracking_ids allow us to go back through all of the data and re-create the source_URLs using the same DKRZ's PID Landing Page Service.

Thanks so much for doing this - it is a tremendous relief to know there is a better solution in the works!

@rabernat
Copy link
Contributor Author

Ok I kinda understand! 🤯

This discussion eventually needs to lead to us defining a BigQuery Schema for the table. This can be done via a json file, e.g.:

[
 {
   "description": "[DESCRIPTION]",
   "name": "[NAME]",
   "type": "[TYPE]",
   "mode": "[MODE]"
 },
 {
   "description": "[DESCRIPTION]",
   "name": "[NAME]",
   "type": "[TYPE]",
   "mode": "[MODE]"
 }
]

This will be done via a PR to this repo. I would love if @charlesbluca could take a stab at a first draft for this schema. Then we can continue to discuss / refine via the PR. Charles, is this something you have time for?

@charlesbluca
Copy link
Member

Yes! I'm currently looking into the documentation myself -- I'll begin drafting out a schema definition.

@charlesbluca
Copy link
Member

Opened up a PR with my first draft.

@charlesbluca
Copy link
Member

Now that we've imported the table to BigQuery, I'll start looking into making scripts to make additions to it - @naomi-henderson, do you have any scripts I could look at to give a sense of what functionality we need?

@naomi-henderson
Copy link
Contributor

Well, as I explained above, I don't add datasets one-by-one, so don't really have a script for you. I guess we need to give BigQuery various pieces of information, starting with the gsurl (the GCS url) and dataset_id (the tuple of keyword values). You could help me improve on my basic functions, get_zid and get_zdict to go from gsurl (the GCS url) to dataset_id (the tuple of keywords) to a dictionary. Note that my functions assume there is a trailing "/" - it would be better if it worked with or without this last "/". In addition we need to specify whether or not there is an ES-DOC issue_id, but if you can show me how to put the basics into the BigQuery table, I can probably add the data for the other columns.

e.g.:

gsurl = `gs://cmip6/CMIP/NCAR/CESM2/piControl/r1i1p1f1/Amon/tas/gn/
zid = ['CMIP', 'NCAR', 'CESM2', 'piControl', 'r1i1p1f1', 'Amon', 'tas', 'gn']
zdict = {'activity_id': 'CMIP', 'institution_id': 'NCAR', 'source_id': 'CESM2', 'experiment_id': 'piControl',\
 'member_id': 'r1i1p1f1', 'table_id': 'Amon', 'variable_id': 'tas', 'grid_label': 'gn'}

using:

def get_zid(gsurl):
    ''' given a GC zarr location, return the dataset_id'''
    assert gsurl[:10] == 'gs://cmip6'
    return gsurl[11:-1].split('/')

def get_zdict(gsurl):
    ''' given a GC zarr location, return a dictionary of keywords'''
    zid = get_zid(gsurl)
    keys = ['activity_id','institution_id','source_id','experiment_id','member_id','table_id','variable_id','grid_label']
    values = list(zid)
    return dict(zip(keys,values)) 

Here are my relevant notebooks but I don't think they will be very helpful:

@charlesbluca
Copy link
Member

Opened up a PR #5 with some simple scripts that allow rows to be inserted into the BigQuery table.

@charlesbluca
Copy link
Member

@naomi-henderson, were you able to check out these scripts? They show relatively simple examples of inserting to the BigQuery table, but it wouldn't be too difficult to expand upon them if need be.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants