-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save grain trace data to HDF5 format #790
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #790 +/- ##
==========================================
+ Coverage 84.36% 84.73% +0.36%
==========================================
Files 21 21
Lines 3134 3196 +62
==========================================
+ Hits 2644 2708 +64
+ Misses 490 488 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some thoughts from an hour looking over and thinking about it, not all are fully formed I'm afraid.
Generally I think if we take some time to try and leverage Numpy/Scipy functions to leverage vectorisation it will pay off in the long run.
A useful reference I came across on this topic is From Numpy to Python by Nicholas Rougier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing these suggestions so quickly.
Suggested a useful practice for labeling parameters in tests that I discovered and have adopted and think it would be useful going forward to use these.
…or backwards compat
@@ -438,14 +441,17 @@ def test_load_scan_topostats(load_scan_topostats: LoadScans) -> None: | |||
image, px_to_nm_scaling = load_scan_topostats.load_topostats() | |||
grain_masks = load_scan_topostats.grain_masks | |||
above_grain_mask = grain_masks["above"] | |||
grain_trace_data = load_scan_topostats.grain_trace_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These lines updated from re-creating the demo .topostats
file file.topostats
since we are changing a part of how they are generated, to keep it consistent.
topostats/io.py
Outdated
} | ||
|
||
|
||
def _hdf5_add_known_datatype( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not unit tested. It is covered comprehensively though by test_dict_to_hdf5_and_hdf5_to_dict
which tests turning dicts into hdf5 files and hdf5 files back to dicts. (Codecov helpfully pointed out that the block of code in _hdf5_add_known_datatypes()
that handles lists was not covered so I fixed that, and also we know that codecov is able to detect coverage of that function.)
I started to write a test for this function but found that I was just duplicating the aforementioned test.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good for getting the trace data into the HDF5 files.
Various comments in-line, some pedantry about variable names I'm afraid.
I think once this is in place we can then write a function that takes a dictionary or loads a .topofile
, extracts the heights and writes them to JSON.
One thing we should probably do soon though is go through carefully and move what we can to topofileformats.
tests/test_io.py
Outdated
), | ||
], | ||
) | ||
def test_dict_to_hdf5_and_hdf5_to_dict(tmp_path: Path, input_dict: dict, group_path: str, expected: dict) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to favour keeping tests focused on a single function and would be inclined to split these. To avoid duplication of parameters we can parametrize fixtures although I think I might have encountered a problem with pytest-lazyfixtures
at some point in the last few months (EDIT : It was #787). I based the above article on this post which shows the other way of doing this.
I can understand why its written as so as it takes dict > hdf5 > dict
and checks the result is the same as the original (in which case we only really need to define a single dict
. But if the final assert()
fails there is no way of knowing which of the two functions has failed.
Writing dict > hdf5
and then loading with plain h5py
would allow a more direct test of the dict_to_hdf5
perhaps (I appreciate this may result in simply writing in the tests code that is in hdf5_to_dict()
and so its a bit of a 🐔 and 🥚 argument).
tests/test_io.py
Outdated
|
||
assert hdf5_file_keys == [ | ||
# Load the saved .topostats file using LoadScans | ||
loadscans = LoadScans([tmp_path / "topostats_file_test.topostats"], channel="") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it would it be useful to move this out to a fixture?
tests/test_io.py
Outdated
# Load the saved .topostats file using LoadScans | ||
loadscans = LoadScans([tmp_path / "topostats_file_test.topostats"], channel="") | ||
loadscans.get_data() | ||
read_topostats_file_data_dict = loadscans.img_dict["topostats_file_test"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read_topostats_file_data_dict = loadscans.img_dict["topostats_file_test"] | |
topostats_data = loadscans.img_dict["topostats_file_test"] |
Its the result of having read_
and putting the type information in a variable name is a Python anti-pattern.
We're on top of it now but at some point I really want to go through and ensure typehints are correct and introduce mypy
to the pre-commit checks to help with this (see #516 and #721).
topostats/tracing/dnatracing.py
Outdated
# Flip every labelled region to be 1 instead of its label | ||
cropped_masks = [np.where(grain == 0, 0, 1) for grain in cropped_masks] | ||
return (cropped_images, cropped_masks) | ||
cropped_masks_dict = {index: np.where(grain == 0, 0, 1) for index, grain in cropped_masks_dict.items()} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, types in variable names are a Python anti-pattern, same is true for cropped_image_dict
.
topostats/io.py
Outdated
@@ -32,6 +33,48 @@ | |||
# pylint: disable=too-many-lines | |||
|
|||
|
|||
def dict_almost_equal(dict1, dict2, abs_tol=1e-9): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be inclined to place this in another module as its not really concerned with input/output.
It looks like its only used in the tests so I wonder if its worth defining this in the test file? 🤔
I'll have a search and see if there are recommendations. I'd be surprised if there isn't already a solution to comparing dictionaries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are solutions to comparing dictionaries but not recursively, with numpy arrays and allowing a tolerance as far as I can tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I found similar classes/functions mentioned in various places.
I think as this is solely for testing it perhaps shouldn't be in topostats.io
but defined in the test files where its compared. Not sure if that is good practice or if it could be incorporated in some other manner.
Co-authored-by: Neil Shephard <[email protected]>
…opoStats into SylviaWhittle/trace_stats_HDF5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic work @SylviaWhittle very comprehensive tests.
I had one thought, noted in line, that we could perhaps address in the future but this looks good to go.
I would suggest after this merged a release (v2.2.1
) could then be made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding in some many tests here, really useful 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very comprehensive tests here, thank you 👍
image_spline_trace = tracing_results["image_spline_trace"] | ||
tracing_stats[direction]["threshold"] = direction | ||
|
||
grain_trace_data[direction] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR but I wonder if we could simplify this in the future and have dnatracing.trace_image()
return the dictionary that we want/need rather than cobbling it together here. 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Epic work @SylviaWhittle very comprehensive tests.
I had one thought, noted in line, that we could perhaps address in the future but this looks good to go.
I would suggest after this merged a release (v2.2.1
) could then be made.
Closes #802 Arrays and data are now saved in HDF5 files (#790) and so `.npy` arrays are somewhat redundant. This PR leaves the `io.save_array()` function in place should interactive use require saving of arrays but removes its use from `topostats.processing.run_filters()` so that the `.npy` files are no longer saved to disk. Users wishing to access processed data should load it from the HDF5 formatted `.topostats` files that are saved during processing.
Closes #488
This PR replaces #695. Instead of saving the trace data to JSON, we have decided to save it to the
.topostats
file in HDF5 format.Experimentalists have requested that the grain trace data be accessible and readable in plain text format if possible, but we will add functions to make it easier to work with later on, adding a JSON dump function if absolutely necessary.