Is there a way to cmorize out-of-memory datasets ? #166

aulemahal · 2023-08-18T22:04:13Z

Hi!

I'm tasked with making our CORDEX (Ouranos MRCC5) data publishable. Hourly files at NAM-11 are, of course, quite large. Opening a single year of tas, I get an array of shape (8759, 628, 655), which would take 13.4 GB of RAM (float32). Of course, xarray and dask can help me here and I could in theory process this chunk by chunk. However, it seems at first glance that the cmorize tools in py-cordex will load the data, making dask useless.

I think I see that the in-memory requirement comes from cmor itself, but I am asking here as this is where the xarray-compatible implementation is. Sorry if this isn't the best channel.

What do others do in that situation ? Is enough RAM a hard requirement to use cmor ?

Similarly, the 1-year-per-file rule comes from the CORDEX file spec (I have access to the Feb 2023 draft). My data is stored in monthly netCDFs. Could the standardization process be done on the full dataset (all simulated years) and then multiple files would be written. The one year subsetting could even be automatic, based on the specs.

Finally, it seems to me that all this would be much easier if there was a function that takes in a xarray dataset and returns a standardized cmorized xarray dataset(s). Which I could save with xr.save_mfdataset afterwards. Does that exist ?

Thanks and sorry for the long issue that's not a real issue.

The text was updated successfully, but these errors were encountered:

larsbuntemeyer · 2023-08-19T19:58:52Z

Hi @aulemahal,
thanks for opening this issue, i appreciate it! I am facing exactly the same issues with publishing our regional (GERICS-REMO) datasets. The cordex.cmor module acutally contains code that was originally written for REMO but then i thought it would be easy enough to make it more generally applicable for regional climate model datasets. In the past, we faced a lot of issues with non-consistent meta data in CORDEX (see, e.g., #17) and although grid definitions are quite straight-forward for CORDEX, they have not been really comparable out-of-the-box in the past (see also this comment).

I think I see that the in-memory requirement comes from cmor itself

Yes, that's right. The cordex.cmor module is simply wrapped around the python interface of the cmor library itself and combines it with the meta data and grid capabilities of py-cordex. But the actually rewrite is done in the C-library backend of cmor which will have to load all data for a cmorized file into memory at processing time. So the solution i also have implemented and that you mentioned is to group the input dataset by file frequency (e.g., on year per file for hourly data as you mentioned) and cmorize it group by group. Or in other words: py-cordex is not able to cmorize lazily.

The one year subsetting could even be automatic, based on the specs.

Absolutey, that is definitely on my roadmap since it is quite straight forward to implement, e.g., in my driving scripts i use something like this:

file_chunksize = {
    "1hr": "A", 
    "3hr": "A",
    "6hr": "A",
    "day": "5A",
    "mon": "10A"
}

def get_chunks(ds, freq, **kwargs):
    _, chunks = zip(*ds.resample(time=file_chunksize[freq]), **kwargs)
    return chunks

where i cmorize the output chunks from get_chunks one after the other. This is quite feasible even for hourly output, however, you'll still need some HPC resources for this i guess. Please also note, that pandas decadal notation (A) is not the same as the archive draft mentions, e.g., CORDEX usually wants decades to begin in 1951 not 1950 for whatever reason. On the other hand, this is more or less a recommendation i guess.

Finally, it seems to me that all this would be much easier if there was a function that takes in a xarray dataset and returns a standardized cmorized xarray dataset(s). Which I could save with xr.save_mfdataset afterwards. Does that exist ?

My very words! I have struggled a lot with cmorization in the past (see, e.g., also this discussion), and i actually started a project (xcmor) where i have implemented some of my insights. The goal is exactly what you mentioned, e.g., to cmorize lazily, have some dynamic implementation of those rules that cmor implements and give the user more flexibilty in the actual output format and storage options. Finally this should replace also the cordex.cmor module. Please note, that that xcmor is really in an early phase but totally open for contributions!

aulemahal · 2023-08-21T14:55:46Z

Ah! Happy to see that I'm not the only one struggling with the unification of those shiny high-level tools (xarray) and the ancient and solid-as-titanium ones (cdo, nco, cmor). ;)

I'll try out xcmor and I'll be happy to contribute whatever it lacks (and that I have time to implement cleanly)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to cmorize out-of-memory datasets ? #166

Is there a way to cmorize out-of-memory datasets ? #166

aulemahal commented Aug 18, 2023

larsbuntemeyer commented Aug 19, 2023 •

edited

Loading

aulemahal commented Aug 21, 2023

Is there a way to cmorize out-of-memory datasets ? #166

Is there a way to cmorize out-of-memory datasets ? #166

Comments

aulemahal commented Aug 18, 2023

larsbuntemeyer commented Aug 19, 2023 • edited Loading

aulemahal commented Aug 21, 2023

larsbuntemeyer commented Aug 19, 2023 •

edited

Loading