Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save grain trace data to HDF5 format #790

Merged
merged 40 commits into from
Feb 12, 2024

Conversation

SylviaWhittle
Copy link
Collaborator

@SylviaWhittle SylviaWhittle commented Jan 31, 2024

Closes #488

This PR replaces #695. Instead of saving the trace data to JSON, we have decided to save it to the .topostats file in HDF5 format.

Experimentalists have requested that the grain trace data be accessible and readable in plain text format if possible, but we will add functions to make it easier to work with later on, adding a JSON dump function if absolutely necessary.

@SylviaWhittle SylviaWhittle self-assigned this Jan 31, 2024
Copy link

codecov bot commented Jan 31, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8bfa700) 84.36% compared to head (4ae1b70) 84.73%.
Report is 21 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #790      +/-   ##
==========================================
+ Coverage   84.36%   84.73%   +0.36%     
==========================================
  Files          21       21              
  Lines        3134     3196      +62     
==========================================
+ Hits         2644     2708      +64     
+ Misses        490      488       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts from an hour looking over and thinking about it, not all are fully formed I'm afraid.

Generally I think if we take some time to try and leverage Numpy/Scipy functions to leverage vectorisation it will pay off in the long run.

A useful reference I came across on this topic is From Numpy to Python by Nicholas Rougier.

topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved
topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved
topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved
topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing these suggestions so quickly.

Suggested a useful practice for labeling parameters in tests that I discovered and have adopted and think it would be useful going forward to use these.

tests/tracing/test_dnatracing_single_grain.py Outdated Show resolved Hide resolved
tests/tracing/test_dnatracing_single_grain.py Outdated Show resolved Hide resolved
tests/tracing/test_dnatracing_single_grain.py Outdated Show resolved Hide resolved
tests/tracing/test_dnatracing_single_grain.py Outdated Show resolved Hide resolved
tests/tracing/test_dnatracing_single_grain.py Outdated Show resolved Hide resolved
tests/tracing/test_dnatracing_single_grain.py Outdated Show resolved Hide resolved
tests/tracing/test_dnatracing_single_grain.py Outdated Show resolved Hide resolved
tests/tracing/test_dnatracing_single_grain.py Show resolved Hide resolved
topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved
topostats/tracing/dnatracing.py Show resolved Hide resolved
@@ -438,14 +441,17 @@ def test_load_scan_topostats(load_scan_topostats: LoadScans) -> None:
image, px_to_nm_scaling = load_scan_topostats.load_topostats()
grain_masks = load_scan_topostats.grain_masks
above_grain_mask = grain_masks["above"]
grain_trace_data = load_scan_topostats.grain_trace_data
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These lines updated from re-creating the demo .topostats file file.topostats since we are changing a part of how they are generated, to keep it consistent.

topostats/io.py Outdated
}


def _hdf5_add_known_datatype(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not unit tested. It is covered comprehensively though by test_dict_to_hdf5_and_hdf5_to_dict which tests turning dicts into hdf5 files and hdf5 files back to dicts. (Codecov helpfully pointed out that the block of code in _hdf5_add_known_datatypes() that handles lists was not covered so I fixed that, and also we know that codecov is able to detect coverage of that function.)

I started to write a test for this function but found that I was just duplicating the aforementioned test.

What do you think?

@SylviaWhittle SylviaWhittle marked this pull request as ready for review February 6, 2024 01:57
Copy link
Collaborator

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good for getting the trace data into the HDF5 files.

Various comments in-line, some pedantry about variable names I'm afraid.

I think once this is in place we can then write a function that takes a dictionary or loads a .topofile, extracts the heights and writes them to JSON.

One thing we should probably do soon though is go through carefully and move what we can to topofileformats.

tests/test_io.py Show resolved Hide resolved
tests/test_io.py Outdated Show resolved Hide resolved
tests/test_io.py Outdated
),
],
)
def test_dict_to_hdf5_and_hdf5_to_dict(tmp_path: Path, input_dict: dict, group_path: str, expected: dict) -> None:
Copy link
Collaborator

@ns-rse ns-rse Feb 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to favour keeping tests focused on a single function and would be inclined to split these. To avoid duplication of parameters we can parametrize fixtures although I think I might have encountered a problem with pytest-lazyfixtures at some point in the last few months (EDIT : It was #787). I based the above article on this post which shows the other way of doing this.

I can understand why its written as so as it takes dict > hdf5 > dict and checks the result is the same as the original (in which case we only really need to define a single dict. But if the final assert() fails there is no way of knowing which of the two functions has failed.

Writing dict > hdf5 and then loading with plain h5py would allow a more direct test of the dict_to_hdf5 perhaps (I appreciate this may result in simply writing in the tests code that is in hdf5_to_dict() and so its a bit of a 🐔 and 🥚 argument).

tests/test_io.py Outdated

assert hdf5_file_keys == [
# Load the saved .topostats file using LoadScans
loadscans = LoadScans([tmp_path / "topostats_file_test.topostats"], channel="")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would it be useful to move this out to a fixture?

tests/test_io.py Outdated
# Load the saved .topostats file using LoadScans
loadscans = LoadScans([tmp_path / "topostats_file_test.topostats"], channel="")
loadscans.get_data()
read_topostats_file_data_dict = loadscans.img_dict["topostats_file_test"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
read_topostats_file_data_dict = loadscans.img_dict["topostats_file_test"]
topostats_data = loadscans.img_dict["topostats_file_test"]

Its the result of having read_ and putting the type information in a variable name is a Python anti-pattern.

We're on top of it now but at some point I really want to go through and ensure typehints are correct and introduce mypy to the pre-commit checks to help with this (see #516 and #721).

# Flip every labelled region to be 1 instead of its label
cropped_masks = [np.where(grain == 0, 0, 1) for grain in cropped_masks]
return (cropped_images, cropped_masks)
cropped_masks_dict = {index: np.where(grain == 0, 0, 1) for index, grain in cropped_masks_dict.items()}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, types in variable names are a Python anti-pattern, same is true for cropped_image_dict.

topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved
topostats/io.py Outdated
@@ -32,6 +33,48 @@
# pylint: disable=too-many-lines


def dict_almost_equal(dict1, dict2, abs_tol=1e-9):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be inclined to place this in another module as its not really concerned with input/output.

It looks like its only used in the tests so I wonder if its worth defining this in the test file? 🤔

I'll have a search and see if there are recommendations. I'd be surprised if there isn't already a solution to comparing dictionaries.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are solutions to comparing dictionaries but not recursively, with numpy arrays and allowing a tolerance as far as I can tell.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I found similar classes/functions mentioned in various places.

I think as this is solely for testing it perhaps shouldn't be in topostats.io but defined in the test files where its compared. Not sure if that is good practice or if it could be incorporated in some other manner.

topostats/io.py Outdated Show resolved Hide resolved
topostats/io.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic work @SylviaWhittle very comprehensive tests.

I had one thought, noted in line, that we could perhaps address in the future but this looks good to go.

I would suggest after this merged a release (v2.2.1) could then be made.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding in some many tests here, really useful 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very comprehensive tests here, thank you 👍

image_spline_trace = tracing_results["image_spline_trace"]
tracing_stats[direction]["threshold"] = direction

grain_trace_data[direction] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but I wonder if we could simplify this in the future and have dnatracing.trace_image() return the dictionary that we want/need rather than cobbling it together here. 🤔

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

Copy link
Collaborator

@ns-rse ns-rse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Epic work @SylviaWhittle very comprehensive tests.

I had one thought, noted in line, that we could perhaps address in the future but this looks good to go.

I would suggest after this merged a release (v2.2.1) could then be made.

@SylviaWhittle SylviaWhittle added this pull request to the merge queue Feb 12, 2024
Merged via the queue into main with commit 62e69c6 Feb 12, 2024
13 checks passed
@SylviaWhittle SylviaWhittle deleted the SylviaWhittle/trace_stats_HDF5 branch February 12, 2024 11:14
ns-rse added a commit that referenced this pull request Feb 20, 2024
Closes #802

Arrays and data are now saved in HDF5 files (#790) and so `.npy` arrays are somewhat redundant. This PR leaves the
`io.save_array()` function in place should interactive use require saving of arrays but removes its use from
`topostats.processing.run_filters()` so that the `.npy` files are no longer saved to disk.

Users wishing to access processed data should load it from the HDF5 formatted `.topostats` files that are saved during
processing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Molecule trace x, y, z coordinates
2 participants