-
Notifications
You must be signed in to change notification settings - Fork 5
Handling non-contiguous datasets #14
Comments
@aradhakrishnanGFDL , I just opened an issue on |
@naomi-henderson, not sure if this is useful, but IPSL have written a nice tool to do time-axis checking: |
@agstephens - very helpful! thanks |
Ah, they are reading the netcdf files to get the calendar but I am trying to use just the file names themselves to see if there is a file missing ... Opening the first netcdf file in each dataset would be more reliable - so this may be useful later on |
This is very helpful and detailed!
Do you have a sense of how best to communicate this to users of these datasets? I am worried that people will make assumptions about what the data should look like and pass blame when it doesn't match their expectations. For a more concrete example, I care most about ScenarioMIP right now, and not every center/model was run for every ssp. I've sometimes referenced https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ScenarioMIP/index.html for a table on what exists/what doesn't but its a little bit tricky to read. I'm wondering if we should have something like this for the cloud holdings where Grey = never exists, Blue = Zarr, Green = Netcdf, Purple = Both, Yellow = exists but not on cloud. |
Yes, I agree that we do not have very effective ways to communicate to the users! In fact, I even keep forgetting about those tables kept at pcmdi! I like the idea of color coding the cloud holdings - need to keep that in mind! I think it would be much better to have efficient tools for querying all of the cloud holdings directly! Then we won't have to generate static tables, etc, and worry about keeping them current. |
@aradhakrishnanGFDL, I have put 3 lists of non-contiguous datasets (yearly, monthly and daily) into our S3 bucket:
There is also a python notebook for checking the differences between the current S3 zarr and S3 netcdf buckets: |
Hi @naomi-henderson Great. Thank you. I will plug in the esgf-world csv from https://cmip6-nc.s3.us-east-2.amazonaws.com/esgf-world.csv.gz (will be refreshed again this week). |
Hi @aradhakrishnanGFDL , good. I didn't bother to exclude the non-contiguous datasets since there were not so many. I just thought it might give a better idea of the issues. The esgf-world csv used in the notebook is fairly recent, March 15, I think, and I had crawled the 'esgf-world' bucket to create it. Is there also a 'cmip6-nc' bucket? Perhaps I used the wrong bucket? |
Ok, sounds good @naomi-henderson . You did use the right bucket: esgf-world. |
Quick update and info- Here is the slightly modified comparison notebook using the latest esgf-world catalog. The catalog still does not account for the time discontinuity. But, we are planning to incorporate the check to some extent in our UDA (internal to GFDL) and S3 -- sanity checker script, though the details are yet to be determined (e.g. query ESGF API or thredds to see if the file is missing there as well or not -- to account for the cases you described) |
Each CMIP6 dataset in the ESGF-CoG nodes consists of an identifier(e.g., CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr) and a version (e.g., 20190920), as seen, for example, here:
When we look at this dataset, we normally start by concatenating the netcdf files in time (here there are 17), using, for example, the xarray method 'open_mfdataset'.
The problem comes when the netcdf files are not contiguous and therefore the resulting xarray dataset has a time grid which is not complete. Some are relatively easy to spot. For example, if just one of five files is missing it might be obvious that there is a problem.
Example 1: S3 has 4 netcdf files, 5 are needed for continuity
At https://esgf-node.llnl.gov/search/cmip6/ there is another file available:
Never mind why one is missing, these things easily happen. But if we blindly concatenate the 4 files, we have a large gap in the time grid.
The real problem comes with there are many files and just one is missing.
Example 2: S3 has 85 netcdf files, 86 are needed for continuity
['vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20150101-20151231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20160101-20161231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20170101-20171231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20180101-20181231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20190101-20191231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20200101-20201231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20210101-20211231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20220101-20221231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20230101-20231231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20240101-20241231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20250101-20251231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20260101-20261231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20270101-20271231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20280101-20281231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20290101-20291231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20300101-20301231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20310101-20311231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20320101-20321231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20330101-20331231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20340101-20341231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20350101-20351231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20360101-20361231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20370101-20371231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20380101-20381231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20390101-20391231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20400101-20401231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20410101-20411231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20420101-20421231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20430101-20431231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20440101-20441231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20450101-20451231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20460101-20461231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20470101-20471231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20480101-20481231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20490101-20491231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20500101-20501231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20510101-20511231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20520101-20521231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20530101-20531231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20540101-20541231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20550101-20551231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20560101-20561231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20570101-20571231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20580101-20581231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20590101-20591231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20600101-20601231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20610101-20611231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20620101-20621231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20630101-20631231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20640101-20641231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20650101-20651231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20660101-20661231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20670101-20671231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20680101-20681231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20690101-20691231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20700101-20701231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20720101-20721231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20730101-20731231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20740101-20741231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20750101-20751231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20760101-20761231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20770101-20771231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20780101-20781231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20790101-20791231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20800101-20801231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20810101-20811231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20820101-20821231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20830101-20831231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20840101-20841231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20850101-20851231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20860101-20861231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20870101-20871231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20880101-20881231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20890101-20891231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20900101-20901231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20910101-20911231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20920101-20921231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20930101-20931231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20940101-20941231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20950101-20951231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20960101-20961231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20970101-20971231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20980101-20981231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20990101-20991231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_21000101-21001231.nc']
The year 2071 is missing from S3, although it is available through ESGF-CoG.
In these two examples, the netcdf files are missing, but do exist. There are many other examples where the missing files are not available by some oversight. For others, the files were never meant to be uploaded. For instance, particular experiments are often reported (by some, not all modeling centers) for just a subset of the run time. For example, some of the 'abrupt-4xCO2' datasets only report one chunk of at the beginning of the experiment (adjustment phase) and one chunk at the end (equilibrium). So I have allowed discontinuities in the 'abrupt-4xCO2' datasets (legitimate or not). Some datasets seem to have one year of daily data for a subset of the years - so there are many discontinuities.
So here are some questions for opening this issue:
A cursory check of the current contents of the 's3://esgf-world/CMIP6' collection of netcdf files shows the following for the 212,299 datasets (collections of netcdf files) currently in the bucket, where 'total' is the number of datasets at the given frequency and 'non-contiguous' is the number of these datasets which have a non-contiguous set of netcdf files. I didn't check the hourly and sub-hourly datasets, since my crude method of using the netcdf file names to infer missing days is not as reliable for sub-daily datasets.
The text was updated successfully, but these errors were encountered: