Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add module for computing river flood footprints from GloFAS river discharge data #64

Merged
merged 132 commits into from
Mar 5, 2024

Conversation

peanutfun
Copy link
Member

@peanutfun peanutfun commented Jan 5, 2023

The module includes a data pipeline which automatically downloads GloFAS river discharge data and transforms it to flood footprints, which in turn can be transfered toHazard or RiverFlood objects.

The module introduces a few new dependencies. They have to be installed into the Jenkins test environment for the builds to succeed:

  • dantro for the data pipeline
  • cdsapi for downloading data from the Copernicus Data Store
  • ruamel.yaml for reading YAML files. This one is already a dependency of dantro.

Major changes in CLIMADA Petals:

  • Add the climada_petals.hazard.rf_glofas subpackage
  • Add climada_petals.util.cds_glofas_downloader utility functions
  • Add module documentation

To do:

  • Tutorial
  • FLOPROS database integration
  • Some more tests
  • Fix linter issues

peanutfun and others added 22 commits October 18, 2022 15:22
Use cdsapi to download GloFAS data from Copernicus Data Store. So far,
the actual download is not tested.

* Add util functions for downloading data.
* Add unit tests for request handling.
* Update operations to fix issues found when writing the tests.
* Add unit test case for dantro operations.
* Tweak CDS GloFAS downloader.
* Add option to set countries instead of lat/lon limits when downloading
  GloFAS data.
* Return pandas Series of Hazards with multi index.
* Use discharge dataset for lat/lon slicing of all other datasets.
* Add unit tests.
Downloads will be skipped if the target file exists with the same
request dict.

* Place the request as YAML file next to the target file for request
  comparison.
* Add option to control using the "cached" results or always downloading
  the data.
* Update unit tests.
* Explicitly list ruamel.yaml as requirement (already required by
  dantro).
NOTE: Commented code would be an alternative to define the select
dimension based on values instead of indices.

* Add operation
* Add test case for operation
* Update unit tests accordingly.
* Add core dimension checks to flood depth unit tests.
* Add operations and config for computing the GEV fits and merging flood
  maps, which are both used for computing a flood footprint.
* Update affected operations and configs.
* Remove GloFASRiverFlood class in favor of two functions.
* Update tests
* Move respective files into their own subdirectory.
* Adapt configuration files to latest dantro version.
* Add 'transform_ops.py' containing only dantro transformations.
* Expose user functions via dedicated __init__.py
* Add option to run tasks in parallel
Used for reading GeoTIFF with xarray.
@peanutfun peanutfun marked this pull request as draft January 5, 2023 16:07
@peanutfun peanutfun linked an issue Jan 9, 2023 that may be closed by this pull request
@peanutfun
Copy link
Member Author

@emanuel-schmid Could you have a look at why the new dependencies are not found in the checks?

@peanutfun
Copy link
Member Author

peanutfun commented Dec 8, 2023

@tovogt Thanks again for the thorough review! To comment on your overall thoughts:

It's really unfortunate that data needs to be written uncompressed first due to performance issues.

Yes, but this is what I came up with after months of using the module. I don't think I can do much better without further help. In my experience, using zlib is indeed horribly slow with dask, and it also does not properly support multi-process writing to a single file. You suggest calling compute first and then saving the data. However, this requires the entire data to fit into memory in the first place. Depending on the country and time frame you want to compute flood footprints for, this is not feasible given the usual memory space of a personal computer.

However, my module actually gives users the option to optimize this themselves. Setting store_intermediates to False and executing each step of RiverFloodComputation.compute themselves, they are free to call compute and store data as they see fit. I tried to find default settings that work no-matter-what. The main restriction now is drive space. You might end up writing hundreds of GB, but it actually is quite performant (with a modern SSD)

In the tutorial notebook, there are some occurrences of py:func which are not translated properly by Sphinx.

The Myst parser should support these types of references, see https://myst-parser.readthedocs.io/en/latest/syntax/cross-referencing.html#reference-roles

Now that I look at the doc/conf.py, we might actually still use the old nbsphinx parser for reading the notebooks. I'll try to fix that.

@tovogt
Copy link
Collaborator

tovogt commented Dec 11, 2023

Yes, but this is what I came up with after months of using the module. I don't think I can do much better without further help. In my experience, using zlib is indeed horribly slow with dask, and it also does not properly support multi-process writing to a single file. You suggest calling compute first and then saving the data. However, this requires the entire data to fit into memory in the first place. Depending on the country and time frame you want to compute flood footprints for, this is not feasible given the usual memory space of a personal computer.

I think it's desirable to require each individual NetCDF file to contain at most as much data as could potentially fit into (a reasonable amount of) memory. Ideally, I would even propose to have at most 4 GB of (uncompressed) data per NetCDF file. With NetCDFs, it is very easy to just split up data into several files, and then load the dataset as a multi-file-dataset, e.g. using xr.open_mfdataset. If you adhere to that, you can very well call compute for each chunk of data that's supposed to end up in an individual file. It's much faster, the data is very easy to handle, and there are practically no disadvantages. Monolithic, long-running processes that produce monolithic chunks of data, are extremely inconvenient in almost every environment. For all projects I work with, I try to split up everything into a high number of slim processes with short run times that produce comparably small chunks of data each - and it's much more convenient under almost all circumstances I can think of.

However, my module actually gives users the option to optimize this themselves. Setting store_intermediates to False and executing each step of RiverFloodComputation.compute themselves, they are free to call compute and store data as they see fit. I tried to find default settings that work no-matter-what. The main restriction now is drive space. You might end up writing hundreds of GB, but it actually is quite performant (with a modern SSD)

As I said, this is not a merge-blocker from my side, and I'm happy to go with this solution.

Installing xesmf would require to reload the environment,
which does not happen online.
@tovogt
Copy link
Collaborator

tovogt commented Dec 12, 2023

You are using xesmf to regrid a raster with bilinear interpolation. Why don't you use rasterio.warp.reproject for that? It would avoid introducing a new dependency, it's really very powerful, and it's already used in several other places in CLIMADA.

@peanutfun
Copy link
Member Author

Why don't you use rasterio.warp.reproject for that?

Simply because I am not familiar with it. I first used the xarray-internal interpolate, which is horribly slow and does not take geospatial information into account. So I switched to xesmf because it was recommended to me and is simple to use for xarray data structures. But it is also difficult to install, and poses an issue on Euler. So I would be very happy about a suggestion how to drop it and switch to another implementation, given that it is not much slower.

@tovogt
Copy link
Collaborator

tovogt commented Dec 13, 2023

Okay, after a closer look, I think you won't be able to have anything similar to nearest_s2d extrapolation in rasterio or similarly basic packages.

peanutfun and others added 4 commits January 16, 2024 12:21
This avoids overwriting data downloaded for the same day (forecast)
or year (reanalysis/historical).
Copy link
Collaborator

@tovogt tovogt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ThomasRoosli for the final cleanup. This is ready to be merged from my side.

@ThomasRoosli ThomasRoosli merged commit c118b6c into develop Mar 5, 2024
4 checks passed
@emanuel-schmid emanuel-schmid deleted the feature/glofas-river-flood branch March 6, 2024 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add module for computing river flood hazards from GloFAS discharge data
5 participants