Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

Handling non-contiguous datasets #14

Open
naomi-henderson opened this issue Mar 21, 2021 · 11 comments
Open

Handling non-contiguous datasets #14

naomi-henderson opened this issue Mar 21, 2021 · 11 comments

Comments

@naomi-henderson
Copy link
Contributor

Each CMIP6 dataset in the ESGF-CoG nodes consists of an identifier(e.g., CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr) and a version (e.g., 20190920), as seen, for example, here:

  1. CMIP6.CMIP.NCC.NorESM2-LM.historical.r2i1p1f1.Omon.thetao.gr
    Data Node: noresg.nird.sigma2.no
    Version: 20190920
    Total Number of Files (for all variables): 17

When we look at this dataset, we normally start by concatenating the netcdf files in time (here there are 17), using, for example, the xarray method 'open_mfdataset'.

The problem comes when the netcdf files are not contiguous and therefore the resulting xarray dataset has a time grid which is not complete. Some are relatively easy to spot. For example, if just one of five files is missing it might be obvious that there is a problem.

Example 1: S3 has 4 netcdf files, 5 are needed for continuity

In the current `s3://esgf-world/CMIP6` bucket, there are 4 netcdf files starting with `https://aws-cloudnode.esgfed.org/thredds/fileServer/CMIP6/ScenarioMIP/NOAA-GFDL/GFDL-ESM4/ssp370/r1i1p1f1/Omon/thetao/gr/v20180701/`:
['thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_201501-203412.nc',
 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_203501-205412.nc',
 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_205501-207412.nc',
 'thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_209501-210012.nc']

At https://esgf-node.llnl.gov/search/cmip6/ there is another file available:

['thetao_Omon_GFDL-ESM4_ssp370_r1i1p1f1_gr_207501-209412.nc']

Never mind why one is missing, these things easily happen. But if we blindly concatenate the 4 files, we have a large gap in the time grid.

The real problem comes with there are many files and just one is missing.

Example 2: S3 has 85 netcdf files, 86 are needed for continuity

In the current `s3://esgf-world/CMIP6` bucket, there are 85 netcdf files starting with `https://aws-cloudnode.esgfed.org/thredds/fileServer/CMIP6/ScenarioMIP/MIROC/MIROC-ES2L/ssp370/r1i1p1f2/day/vas/gn/v20200318/`:

['vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20150101-20151231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20160101-20161231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20170101-20171231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20180101-20181231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20190101-20191231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20200101-20201231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20210101-20211231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20220101-20221231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20230101-20231231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20240101-20241231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20250101-20251231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20260101-20261231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20270101-20271231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20280101-20281231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20290101-20291231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20300101-20301231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20310101-20311231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20320101-20321231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20330101-20331231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20340101-20341231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20350101-20351231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20360101-20361231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20370101-20371231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20380101-20381231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20390101-20391231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20400101-20401231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20410101-20411231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20420101-20421231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20430101-20431231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20440101-20441231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20450101-20451231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20460101-20461231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20470101-20471231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20480101-20481231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20490101-20491231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20500101-20501231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20510101-20511231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20520101-20521231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20530101-20531231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20540101-20541231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20550101-20551231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20560101-20561231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20570101-20571231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20580101-20581231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20590101-20591231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20600101-20601231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20610101-20611231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20620101-20621231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20630101-20631231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20640101-20641231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20650101-20651231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20660101-20661231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20670101-20671231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20680101-20681231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20690101-20691231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20700101-20701231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20720101-20721231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20730101-20731231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20740101-20741231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20750101-20751231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20760101-20761231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20770101-20771231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20780101-20781231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20790101-20791231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20800101-20801231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20810101-20811231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20820101-20821231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20830101-20831231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20840101-20841231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20850101-20851231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20860101-20861231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20870101-20871231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20880101-20881231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20890101-20891231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20900101-20901231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20910101-20911231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20920101-20921231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20930101-20931231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20940101-20941231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20950101-20951231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20960101-20961231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20970101-20971231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20980101-20981231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_20990101-20991231.nc',
'vas_day_MIROC-ES2L_ssp370_r1i1p1f2_gn_21000101-21001231.nc']

The year 2071 is missing from S3, although it is available through ESGF-CoG.

In these two examples, the netcdf files are missing, but do exist. There are many other examples where the missing files are not available by some oversight. For others, the files were never meant to be uploaded. For instance, particular experiments are often reported (by some, not all modeling centers) for just a subset of the run time. For example, some of the 'abrupt-4xCO2' datasets only report one chunk of at the beginning of the experiment (adjustment phase) and one chunk at the end (equilibrium).  So I have allowed discontinuities in the 'abrupt-4xCO2' datasets (legitimate or not). Some datasets seem to have one year of daily data for a subset of the years - so there are many discontinuities. 

So here are some questions for opening this issue:

  1. Should we somehow allow the 'legitimate' non-contiguous datasets? If so, should we divide them up into contiguous chunks and serve them separately?
  2. What to do about datasets with missing netcdf files? Certainly we could try complete the list, but if the files do not exist, what then?

A cursory check of the current contents of the 's3://esgf-world/CMIP6' collection of netcdf files shows the following for the 212,299 datasets (collections of netcdf files) currently in the bucket, where 'total' is the number of datasets at the given frequency and 'non-contiguous' is the number of these datasets which have a non-contiguous set of netcdf files. I didn't check the hourly and sub-hourly datasets, since my crude method of using the netcdf file names to infer missing days is not as reliable for sub-daily datasets. 

frequency total non-contiguous percent
yearly 23758 94 3%
monthly 179953 2490 1.4%
daily 23758 1947 8%
hourly 4471 not checked
sub-hourly 749 not checked
@naomi-henderson
Copy link
Contributor Author

naomi-henderson commented Mar 21, 2021

@aradhakrishnanGFDL , I just opened an issue on pangeo-forge/cmip6-pipeline to get our conversation going on the non-contiguous dataset issue. I could provide a listing of the datasets, if that would be useful.

@agstephens
Copy link

@naomi-henderson, not sure if this is useful, but IPSL have written a nice tool to do time-axis checking:
http://prodiguer.github.io/nctime/index.html

@naomi-henderson
Copy link
Contributor Author

@agstephens - very helpful! thanks

@naomi-henderson
Copy link
Contributor Author

Ah, they are reading the netcdf files to get the calendar but I am trying to use just the file names themselves to see if there is a file missing ... Opening the first netcdf file in each dataset would be more reliable - so this may be useful later on

@zflamig
Copy link

zflamig commented Mar 22, 2021

This is very helpful and detailed!

For others, the files were never meant to be uploaded.

Do you have a sense of how best to communicate this to users of these datasets? I am worried that people will make assumptions about what the data should look like and pass blame when it doesn't match their expectations.

For a more concrete example, I care most about ScenarioMIP right now, and not every center/model was run for every ssp. I've sometimes referenced https://pcmdi.llnl.gov/CMIP6/ArchiveStatistics/esgf_data_holdings/ScenarioMIP/index.html for a table on what exists/what doesn't but its a little bit tricky to read. I'm wondering if we should have something like this for the cloud holdings where Grey = never exists, Blue = Zarr, Green = Netcdf, Purple = Both, Yellow = exists but not on cloud.

@naomi-henderson
Copy link
Contributor Author

Yes, I agree that we do not have very effective ways to communicate to the users! In fact, I even keep forgetting about those tables kept at pcmdi! I like the idea of color coding the cloud holdings - need to keep that in mind!

I think it would be much better to have efficient tools for querying all of the cloud holdings directly! Then we won't have to generate static tables, etc, and worry about keeping them current.

@naomi-henderson
Copy link
Contributor Author

@aradhakrishnanGFDL, I have put 3 lists of non-contiguous datasets (yearly, monthly and daily) into our S3 bucket:

There is also a python notebook for checking the differences between the current S3 zarr and S3 netcdf buckets:

For example:
NetCDFvZarr stats

@aradhakrishnanGFDL
Copy link

Hi @naomi-henderson Great. Thank you. I will plug in the esgf-world csv from https://cmip6-nc.s3.us-east-2.amazonaws.com/esgf-world.csv.gz (will be refreshed again this week).
Just to clarify, the three non-contiguous lists you provided are those that I will need to to exclude from the esgf-world csv before another round of comparison from my end? Thanks,

@naomi-henderson
Copy link
Contributor Author

Hi @aradhakrishnanGFDL , good. I didn't bother to exclude the non-contiguous datasets since there were not so many. I just thought it might give a better idea of the issues.

The esgf-world csv used in the notebook is fairly recent, March 15, I think, and I had crawled the 'esgf-world' bucket to create it. Is there also a 'cmip6-nc' bucket? Perhaps I used the wrong bucket?

@aradhakrishnanGFDL
Copy link

Ok, sounds good @naomi-henderson . You did use the right bucket: esgf-world.
cmip6-nc is just the bucket with the intake catalogs and such. It could use a better name! I just updated the CSV https://cmip6-nc.s3.us-east-2.amazonaws.com/esgf-world.csv.gz as well, not sure if the results would change drastically. I used a quick script at https://github.com/aradhakrishnanGFDL/CatalogBuilder/blob/master/gen_intake_s3.py to generate the catalog.

@aradhakrishnanGFDL
Copy link

Quick update and info- Here is the slightly modified comparison notebook using the latest esgf-world catalog. The catalog still does not account for the time discontinuity. But, we are planning to incorporate the check to some extent in our UDA (internal to GFDL) and S3 -- sanity checker script, though the details are yet to be determined (e.g. query ESGF API or thredds to see if the file is missing there as well or not -- to account for the cases you described)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants