Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to cmorize out-of-memory datasets ? #166

Open
aulemahal opened this issue Aug 18, 2023 · 2 comments
Open

Is there a way to cmorize out-of-memory datasets ? #166

aulemahal opened this issue Aug 18, 2023 · 2 comments

Comments

@aulemahal
Copy link

Hi!

I'm tasked with making our CORDEX (Ouranos MRCC5) data publishable. Hourly files at NAM-11 are, of course, quite large. Opening a single year of tas, I get an array of shape (8759, 628, 655), which would take 13.4 GB of RAM (float32). Of course, xarray and dask can help me here and I could in theory process this chunk by chunk. However, it seems at first glance that the cmorize tools in py-cordex will load the data, making dask useless.

I think I see that the in-memory requirement comes from cmor itself, but I am asking here as this is where the xarray-compatible implementation is. Sorry if this isn't the best channel.

What do others do in that situation ? Is enough RAM a hard requirement to use cmor ?

Similarly, the 1-year-per-file rule comes from the CORDEX file spec (I have access to the Feb 2023 draft). My data is stored in monthly netCDFs. Could the standardization process be done on the full dataset (all simulated years) and then multiple files would be written. The one year subsetting could even be automatic, based on the specs.

Finally, it seems to me that all this would be much easier if there was a function that takes in a xarray dataset and returns a standardized cmorized xarray dataset(s). Which I could save with xr.save_mfdataset afterwards. Does that exist ?

Thanks and sorry for the long issue that's not a real issue.

@larsbuntemeyer
Copy link
Contributor

larsbuntemeyer commented Aug 19, 2023

Hi @aulemahal,
thanks for opening this issue, i appreciate it! I am facing exactly the same issues with publishing our regional (GERICS-REMO) datasets. The cordex.cmor module acutally contains code that was originally written for REMO but then i thought it would be easy enough to make it more generally applicable for regional climate model datasets. In the past, we faced a lot of issues with non-consistent meta data in CORDEX (see, e.g., #17) and although grid definitions are quite straight-forward for CORDEX, they have not been really comparable out-of-the-box in the past (see also this comment).

I think I see that the in-memory requirement comes from cmor itself

Yes, that's right. The cordex.cmor module is simply wrapped around the python interface of the cmor library itself and combines it with the meta data and grid capabilities of py-cordex. But the actually rewrite is done in the C-library backend of cmor which will have to load all data for a cmorized file into memory at processing time. So the solution i also have implemented and that you mentioned is to group the input dataset by file frequency (e.g., on year per file for hourly data as you mentioned) and cmorize it group by group. Or in other words: py-cordex is not able to cmorize lazily.

The one year subsetting could even be automatic, based on the specs.

Absolutey, that is definitely on my roadmap since it is quite straight forward to implement, e.g., in my driving scripts i use something like this:

file_chunksize = {
    "1hr": "A", 
    "3hr": "A",
    "6hr": "A",
    "day": "5A",
    "mon": "10A"
}

def get_chunks(ds, freq, **kwargs):
    _, chunks = zip(*ds.resample(time=file_chunksize[freq]), **kwargs)
    return chunks

where i cmorize the output chunks from get_chunks one after the other. This is quite feasible even for hourly output, however, you'll still need some HPC resources for this i guess. Please also note, that pandas decadal notation (A) is not the same as the archive draft mentions, e.g., CORDEX usually wants decades to begin in 1951 not 1950 for whatever reason. On the other hand, this is more or less a recommendation i guess.

Finally, it seems to me that all this would be much easier if there was a function that takes in a xarray dataset and returns a standardized cmorized xarray dataset(s). Which I could save with xr.save_mfdataset afterwards. Does that exist ?

My very words! I have struggled a lot with cmorization in the past (see, e.g., also this discussion), and i actually started a project (xcmor) where i have implemented some of my insights. The goal is exactly what you mentioned, e.g., to cmorize lazily, have some dynamic implementation of those rules that cmor implements and give the user more flexibilty in the actual output format and storage options. Finally this should replace also the cordex.cmor module. Please note, that that xcmor is really in an early phase but totally open for contributions!

@aulemahal
Copy link
Author

Ah! Happy to see that I'm not the only one struggling with the unification of those shiny high-level tools (xarray) and the ancient and solid-as-titanium ones (cdo, nco, cmor). ;)

I'll try out xcmor and I'll be happy to contribute whatever it lacks (and that I have time to implement cleanly)!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants