Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCTracks: improve hdf5 I/O #735

Merged
merged 8 commits into from
Jun 27, 2023
Merged

TCTracks: improve hdf5 I/O #735

merged 8 commits into from
Jun 27, 2023

Conversation

tovogt
Copy link
Collaborator

@tovogt tovogt commented Jun 6, 2023

Changes proposed in this PR:

  • This changes how TCTracks.from_hdf5 and TCTracks.write_hdf5 work internally. While it does not change the API, it changes the file structure and is not backwards compatible the way it is implemented here in the sense that files that have been stored with the old implementation cannot be read with the new implementation.
  • As mentioned in the PR that originally introduced this feature (HDF5 file IO for TCTracks #349), the earlier file format was nice to understand, but it required a lot of disk space and was very different from other file formats that are commonly used to store TC tracks (like IBTrACS or the format used by CHAZ or by Kerry Emanuel). I also found in the meantime that the I/O is really, really slow for large numbers of tracks.
  • I originally implemented this feature to have a compact file format for storing large numbers of TC tracks, but I found that it's simply too slow for real applications. I also thought that this might be a format that we will use in ISIMIP3a to provide TC tracks, but since it's so slow and has a quite unusual structure, we decided to use a different format instead. The format proposed in this PR is very close to what we will use in ISIMIP3a. We will just rename most of the variables, basically.

I think that nobody currently uses the HDF5 I/O feature of TCTracks objects, and think that it's safe to drop backwards compatibility, but I might be wrong. I think, I also talked to @chahan about this a few months ago and he was quite optimistic that nobody actually uses this feature and we can easily change the file format at some point.

@ThomasRoosli You modified the TCForecast class to be compatible with this feature back in the day (CLIMADA-project/climada_petals#33). The proposed new format won't require any changes to the wrapper you wrote for TCForecast. However, did you end up using this HDF5 I/O feature in practice?

@ChrisFairless @bguillod and everyone that uses CLIMADA's TC feature: Do you know anyone who actually uses the HDF5 IO features of the TCTracks class?

If you insist that backwards compatibility is important, I would have this proposition: We let write_hdf5 write to the new format no matter what - this should be safe. But in from_hdf5, we implement a check that determines whether the user is trying to read from the legacy file format and falls back to the old implementation if necessary. I would prefer to drop backwards compatibility because it's quite a lot of lines of code, and I'm quite convinced that nobody uses the old format. But I might be wrong...

PR Author Checklist

PR Reviewer Checklist

@peanutfun
Copy link
Member

Great addition, thank you @tovogt ! 🎉

Regarding backwards compatibility: The purpose of TCTracks is reading IBTrACS or other track data. Its write_ and from_ methods are only there for cycling data. The instances where dropping backwards compatibility for those might become an issue is when you did some excessive resampling on a lot of tracks, or if you "lost" the original files.

So I am fine with dropping it. However, I would like to have a useful error message when somebody tries to load an old-style file with the new reader. It should contain instructions on how to resolve the issue (load the original data, then store it with the new writer) and what to do if this does not work (raise an issue). In case somebody complains, we can still come up with a script that "translates" the old into the new format.

Finally, I am a bit curious about the testing. Why do the tests still pass? Do you think it's justified to not adapt any test?

@tovogt
Copy link
Collaborator Author

tovogt commented Jun 13, 2023

Thanks for your feedback!

I added a helpful error message in case users try to load a file in the legacy file format. To test this functionality, I also added a 90 KB test data file to the repository. I hope that's okay for a binary blob.

Finally, I am a bit curious about the testing. Why do the tests still pass? Do you think it's justified to not adapt any test?

The tests only tested the cycling of write_hdf5 and from_hdf5. As you mentioned yourself: Its write_ and from_ methods are only there for cycling data so that's what we tested and that part remains unchanged.

@ChrisFairless
Copy link
Collaborator

@bguillod and I don't use the existing functionality (as far as we can remember), so I'm fine if this is a breaking change.

And very excited for faster I/O. Thanks!!

Copy link
Member

@peanutfun peanutfun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes look good to me, but there are a few remaining linter issues. As soon as they are fixed, this can be merged!

climada/hazard/tc_tracks.py Outdated Show resolved Hide resolved
climada/hazard/tc_tracks.py Outdated Show resolved Hide resolved
climada/hazard/tc_tracks.py Outdated Show resolved Hide resolved
climada/hazard/tc_tracks.py Show resolved Hide resolved
@tovogt
Copy link
Collaborator Author

tovogt commented Jun 27, 2023

All changes look good to me, but there are a few remaining linter issues. As soon as they are fixed, this can be merged!

Thanks for double-checking. I pushed another commit that should fix all remaining points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants