TCTracks: improve hdf5 I/O #735

tovogt · 2023-06-06T14:47:00Z

Changes proposed in this PR:

This changes how TCTracks.from_hdf5 and TCTracks.write_hdf5 work internally. While it does not change the API, it changes the file structure and is not backwards compatible the way it is implemented here in the sense that files that have been stored with the old implementation cannot be read with the new implementation.
As mentioned in the PR that originally introduced this feature (HDF5 file IO for TCTracks #349), the earlier file format was nice to understand, but it required a lot of disk space and was very different from other file formats that are commonly used to store TC tracks (like IBTrACS or the format used by CHAZ or by Kerry Emanuel). I also found in the meantime that the I/O is really, really slow for large numbers of tracks.
I originally implemented this feature to have a compact file format for storing large numbers of TC tracks, but I found that it's simply too slow for real applications. I also thought that this might be a format that we will use in ISIMIP3a to provide TC tracks, but since it's so slow and has a quite unusual structure, we decided to use a different format instead. The format proposed in this PR is very close to what we will use in ISIMIP3a. We will just rename most of the variables, basically.

I think that nobody currently uses the HDF5 I/O feature of TCTracks objects, and think that it's safe to drop backwards compatibility, but I might be wrong. I think, I also talked to @chahan about this a few months ago and he was quite optimistic that nobody actually uses this feature and we can easily change the file format at some point.

@ThomasRoosli You modified the TCForecast class to be compatible with this feature back in the day (CLIMADA-project/climada_petals#33). The proposed new format won't require any changes to the wrapper you wrote for TCForecast. However, did you end up using this HDF5 I/O feature in practice?

@ChrisFairless @bguillod and everyone that uses CLIMADA's TC feature: Do you know anyone who actually uses the HDF5 IO features of the TCTracks class?

If you insist that backwards compatibility is important, I would have this proposition: We let write_hdf5 write to the new format no matter what - this should be safe. But in from_hdf5, we implement a check that determines whether the user is trying to read from the legacy file format and falls back to the old implementation if necessary. I would prefer to drop backwards compatibility because it's quite a lot of lines of code, and I'm quite convinced that nobody uses the old format. But I might be wrong...

PR Author Checklist

PR Reviewer Checklist

peanutfun · 2023-06-13T08:28:22Z

Great addition, thank you @tovogt ! 🎉

Regarding backwards compatibility: The purpose of TCTracks is reading IBTrACS or other track data. Its write_ and from_ methods are only there for cycling data. The instances where dropping backwards compatibility for those might become an issue is when you did some excessive resampling on a lot of tracks, or if you "lost" the original files.

So I am fine with dropping it. However, I would like to have a useful error message when somebody tries to load an old-style file with the new reader. It should contain instructions on how to resolve the issue (load the original data, then store it with the new writer) and what to do if this does not work (raise an issue). In case somebody complains, we can still come up with a script that "translates" the old into the new format.

Finally, I am a bit curious about the testing. Why do the tests still pass? Do you think it's justified to not adapt any test?

tovogt · 2023-06-13T11:34:36Z

Thanks for your feedback!

I added a helpful error message in case users try to load a file in the legacy file format. To test this functionality, I also added a 90 KB test data file to the repository. I hope that's okay for a binary blob.

Finally, I am a bit curious about the testing. Why do the tests still pass? Do you think it's justified to not adapt any test?

The tests only tested the cycling of write_hdf5 and from_hdf5. As you mentioned yourself: Its write_ and from_ methods are only there for cycling data so that's what we tested and that part remains unchanged.

ChrisFairless · 2023-06-13T14:45:13Z

@bguillod and I don't use the existing functionality (as far as we can remember), so I'm fine if this is a breaking change.

And very excited for faster I/O. Thanks!!

peanutfun

All changes look good to me, but there are a few remaining linter issues. As soon as they are fixed, this can be merged!

climada/hazard/tc_tracks.py

tovogt · 2023-06-27T08:57:30Z

All changes look good to me, but there are a few remaining linter issues. As soon as they are fixed, this can be merged!

Thanks for double-checking. I pushed another commit that should fix all remaining points.

TCTracks: improve hdf5 io

9b8c6ca

Thomas Vogt added 2 commits June 13, 2023 13:23

TCTracks: remove unsed imports

98113c1

TCTracks.from_hdf5: fail gracefully with legacy file format

2ef0c27

Thomas Vogt and others added 2 commits June 26, 2023 13:21

Merge branch 'develop' into feature/tc_tracks_write_hdf5

4447b1c

Merge branch 'develop' into feature/tc_tracks_write_hdf5

e8d8d34

peanutfun requested changes Jun 27, 2023

View reviewed changes

climada/hazard/tc_tracks.py Outdated Show resolved Hide resolved

climada/hazard/tc_tracks.py Outdated Show resolved Hide resolved

climada/hazard/tc_tracks.py Outdated Show resolved Hide resolved

climada/hazard/tc_tracks.py Show resolved Hide resolved

Thomas Vogt added 2 commits June 27, 2023 10:34

Merge branch 'develop' into feature/tc_tracks_write_hdf5

5c34db5

tc_tracks: _raise_if_legacy_or_unknown_hdf5_format

e0b22b4

Update CHANGELOG.md

9c779a0

peanutfun approved these changes Jun 27, 2023

View reviewed changes

peanutfun merged commit 6fedaf2 into develop Jun 27, 2023

emanuel-schmid deleted the feature/tc_tracks_write_hdf5 branch June 27, 2023 10:58

This was referenced Jun 29, 2023

Adapt tc_forecast to climada_python improved hdf5 IO CLIMADA-project/climada_petals#83

Merged

Flood of cast warnings after improved hdf5 I/O CLIMADA-project/climada_petals#84

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCTracks: improve hdf5 I/O #735

TCTracks: improve hdf5 I/O #735

tovogt commented Jun 6, 2023 •

edited by peanutfun

Loading

peanutfun commented Jun 13, 2023

tovogt commented Jun 13, 2023

ChrisFairless commented Jun 13, 2023

peanutfun left a comment

tovogt commented Jun 27, 2023

TCTracks: improve hdf5 I/O #735

TCTracks: improve hdf5 I/O #735

Conversation

tovogt commented Jun 6, 2023 • edited by peanutfun Loading

PR Author Checklist

PR Reviewer Checklist

peanutfun commented Jun 13, 2023

tovogt commented Jun 13, 2023

ChrisFairless commented Jun 13, 2023

peanutfun left a comment

Choose a reason for hiding this comment

tovogt commented Jun 27, 2023

tovogt commented Jun 6, 2023 •

edited by peanutfun

Loading