Save grain trace data to HDF5 format #790

SylviaWhittle · 2024-01-31T12:03:37Z

Closes #488

This PR replaces #695. Instead of saving the trace data to JSON, we have decided to save it to the .topostats file in HDF5 format.

Experimentalists have requested that the grain trace data be accessible and readable in plain text format if possible, but we will add functions to make it easier to work with later on, adding a JSON dump function if absolutely necessary.

…object

codecov · 2024-01-31T12:25:26Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8bfa700) 84.36% compared to head (4ae1b70) 84.73%.
Report is 21 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #790      +/-   ##
==========================================
+ Coverage   84.36%   84.73%   +0.36%     
==========================================
  Files          21       21              
  Lines        3134     3196      +62     
==========================================
+ Hits         2644     2708      +64     
+ Misses        490      488       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

topostats/io.py

ns-rse

Some thoughts from an hour looking over and thinking about it, not all are fully formed I'm afraid.

Generally I think if we take some time to try and leverage Numpy/Scipy functions to leverage vectorisation it will pay off in the long run.

A useful reference I came across on this topic is From Numpy to Python by Nicholas Rougier.

topostats/tracing/dnatracing.py

Co-authored-by: Neil Shephard <[email protected]>

ns-rse

Thanks for addressing these suggestions so quickly.

Suggested a useful practice for labeling parameters in tests that I discovered and have adopted and think it would be useful going forward to use these.

tests/tracing/test_dnatracing_single_grain.py

topostats/tracing/dnatracing.py

…stances

…or backwards compat

… processing both

SylviaWhittle · 2024-02-06T00:22:15Z

tests/test_io.py

@@ -438,14 +441,17 @@ def test_load_scan_topostats(load_scan_topostats: LoadScans) -> None:
    image, px_to_nm_scaling = load_scan_topostats.load_topostats()
    grain_masks = load_scan_topostats.grain_masks
    above_grain_mask = grain_masks["above"]
+    grain_trace_data = load_scan_topostats.grain_trace_data


These lines updated from re-creating the demo .topostats file file.topostats since we are changing a part of how they are generated, to keep it consistent.

…or it

SylviaWhittle · 2024-02-06T01:55:56Z

topostats/io.py

        }


+def _hdf5_add_known_datatype(


This function is not unit tested. It is covered comprehensively though by test_dict_to_hdf5_and_hdf5_to_dict which tests turning dicts into hdf5 files and hdf5 files back to dicts. (Codecov helpfully pointed out that the block of code in _hdf5_add_known_datatypes() that handles lists was not covered so I fixed that, and also we know that codecov is able to detect coverage of that function.)

I started to write a test for this function but found that I was just duplicating the aforementioned test.

What do you think?

ns-rse

This looks good for getting the trace data into the HDF5 files.

Various comments in-line, some pedantry about variable names I'm afraid.

I think once this is in place we can then write a function that takes a dictionary or loads a .topofile, extracts the heights and writes them to JSON.

One thing we should probably do soon though is go through carefully and move what we can to topofileformats.

tests/test_io.py

ns-rse · 2024-02-06T15:03:34Z

tests/test_io.py

+        ),
+    ],
+)
+def test_dict_to_hdf5_and_hdf5_to_dict(tmp_path: Path, input_dict: dict, group_path: str, expected: dict) -> None:


I tend to favour keeping tests focused on a single function and would be inclined to split these. To avoid duplication of parameters we can parametrize fixtures although I think I might have encountered a problem with pytest-lazyfixtures at some point in the last few months (EDIT : It was #787). I based the above article on this post which shows the other way of doing this.

I can understand why its written as so as it takes dict > hdf5 > dict and checks the result is the same as the original (in which case we only really need to define a single dict. But if the final assert() fails there is no way of knowing which of the two functions has failed.

Writing dict > hdf5 and then loading with plain h5py would allow a more direct test of the dict_to_hdf5 perhaps (I appreciate this may result in simply writing in the tests code that is in hdf5_to_dict() and so its a bit of a 🐔 and 🥚 argument).

ns-rse · 2024-02-06T15:04:51Z

tests/test_io.py

-
-    assert hdf5_file_keys == [
+    # Load the saved .topostats file using LoadScans
+    loadscans = LoadScans([tmp_path / "topostats_file_test.topostats"], channel="")


Do you think it would it be useful to move this out to a fixture?

ns-rse · 2024-02-06T15:06:37Z

tests/test_io.py

+    # Load the saved .topostats file using LoadScans
+    loadscans = LoadScans([tmp_path / "topostats_file_test.topostats"], channel="")
+    loadscans.get_data()
+    read_topostats_file_data_dict = loadscans.img_dict["topostats_file_test"]


Suggested change

read_topostats_file_data_dict = loadscans.img_dict["topostats_file_test"]

topostats_data = loadscans.img_dict["topostats_file_test"]

Its the result of having read_ and putting the type information in a variable name is a Python anti-pattern.

We're on top of it now but at some point I really want to go through and ensure typehints are correct and introduce mypy to the pre-commit checks to help with this (see #516 and #721).

ns-rse · 2024-02-06T15:11:27Z

topostats/tracing/dnatracing.py

    # Flip every labelled region to be 1 instead of its label
-    cropped_masks = [np.where(grain == 0, 0, 1) for grain in cropped_masks]
-    return (cropped_images, cropped_masks)
+    cropped_masks_dict = {index: np.where(grain == 0, 0, 1) for index, grain in cropped_masks_dict.items()}


As mentioned above, types in variable names are a Python anti-pattern, same is true for cropped_image_dict.

topostats/tracing/dnatracing.py

ns-rse · 2024-02-06T15:48:00Z

topostats/io.py

@@ -32,6 +33,48 @@
 # pylint: disable=too-many-lines


+def dict_almost_equal(dict1, dict2, abs_tol=1e-9):


I'd be inclined to place this in another module as its not really concerned with input/output.

It looks like its only used in the tests so I wonder if its worth defining this in the test file? 🤔

I'll have a search and see if there are recommendations. I'd be surprised if there isn't already a solution to comparing dictionaries.

There are solutions to comparing dictionaries but not recursively, with numpy arrays and allowing a tolerance as far as I can tell.

Yes, I found similar classes/functions mentioned in various places.

I think as this is solely for testing it perhaps shouldn't be in topostats.io but defined in the test files where its compared. Not sure if that is good practice or if it could be incorporated in some other manner.

topostats/io.py

Co-authored-by: Neil Shephard <[email protected]>

…opoStats into SylviaWhittle/trace_stats_HDF5

…n io.py

ns-rse

Epic work @SylviaWhittle very comprehensive tests.

I had one thought, noted in line, that we could perhaps address in the future but this looks good to go.

I would suggest after this merged a release (v2.2.1) could then be made.

ns-rse · 2024-02-12T11:02:16Z

tests/tracing/test_dnatracing_single_grain.py

Thanks for adding in some many tests here, really useful 👍

ns-rse · 2024-02-12T11:02:54Z

tests/test_io.py

Very comprehensive tests here, thank you 👍

ns-rse · 2024-02-12T11:07:31Z

topostats/processing.py

                image_spline_trace = tracing_results["image_spline_trace"]
                tracing_stats[direction]["threshold"] = direction

+                grain_trace_data[direction] = {


Not for this PR but I wonder if we could simplify this in the future and have dnatracing.trace_image() return the dictionary that we want/need rather than cobbling it together here. 🤔

ns-rse

Epic work @SylviaWhittle very comprehensive tests.

I had one thought, noted in line, that we could perhaps address in the future but this looks good to go.

I would suggest after this merged a release (v2.2.1) could then be made.

Closes #802 Arrays and data are now saved in HDF5 files (#790) and so `.npy` arrays are somewhat redundant. This PR leaves the `io.save_array()` function in place should interactive use require saving of arrays but removes its use from `topostats.processing.run_filters()` so that the `.npy` files are no longer saved to disk. Users wishing to access processed data should load it from the HDF5 formatted `.topostats` files that are saved during processing.

SylviaWhittle added 4 commits January 30, 2024 17:51

Add recurside_dict_to_hdf5 to io.py in lieu of manual saving

83baa6e

Update processing to handle grain trace data and put it in topostats …

2101cb0

…object

Add functions to get trace heights and cumulative distances

2d77e73

Fix tests broken by turning ordered traces to a dictionary

f8af70a

SylviaWhittle self-assigned this Jan 31, 2024

SylviaWhittle added 2 commits January 31, 2024 12:16

Remove unsupported use of pipe operator in python 3.9

2077127

Remove unsupported use of pipe operator in python 3.9

0ccfc17

SylviaWhittle requested a review from ns-rse January 31, 2024 12:25

ns-rse reviewed Jan 31, 2024

View reviewed changes

topostats/io.py Show resolved Hide resolved

ns-rse reviewed Jan 31, 2024

View reviewed changes

topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved

topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved

topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved

topostats/tracing/dnatracing.py Outdated Show resolved Hide resolved

SylviaWhittle and others added 4 commits January 31, 2024 14:49

Improve description of get_ordered_trace_heights

1b6375f

Co-authored-by: Neil Shephard <[email protected]>

Use sqrt(2) instead of hardcoded value

6ef1a2e

Remove manual n_grain as a grain indexer, instead use enumeration index

4fd477f

Add tests for coord_dist

15c30bd

ns-rse requested changes Feb 2, 2024

View reviewed changes

SylviaWhittle added 13 commits February 2, 2024 23:45

Add function to convert hdf5 loaded data to dictionary

0684400

Add single grain dnatracing tests for height traces and cumulative di…

0be859f

…stances

Vectorise cumulative trace script

547a65d

Add logging statements for loading and support grain trace data

d364199

Revert to using image instead of flattened_image in topostats files f…

04f9295

…or backwards compat

Update topostats example file for loading and update existing tests

19b9b2c

Improve topostats file save and load test

93019c7

Add regression test for .topostats file

80ffe5c

fix error in order of lists being incorrect for assertion

023dd90

Add function to determine if two dictionaries are almost equal

eab9e50

Add logging statements for dict comparison

b9585dc

More logging for dict comparison

7ccded0

Add more logging and ignore img_path in regtest of .topostats file in…

ac20c74

… processing both

SylviaWhittle commented Feb 6, 2024

View reviewed changes

SylviaWhittle added 4 commits February 6, 2024 01:23

Add more test cases for test_dict_almost_equal

7aff978

Fix lists not being saved as numpy arrays to hdf5 and add test case f…

708af75

…or it

Add docstrings to io.py

4c074cb

Turn test params into pytest.param() statements

9329dd0

SylviaWhittle commented Feb 6, 2024

View reviewed changes

SylviaWhittle marked this pull request as ready for review February 6, 2024 01:57

SylviaWhittle requested a review from ns-rse February 6, 2024 01:57

ns-rse requested changes Feb 6, 2024

View reviewed changes

SylviaWhittle and others added 12 commits February 7, 2024 12:32

Fix type hint

ee8c899

Co-authored-by: Neil Shephard <[email protected]>

Fix parameter passing to dict_to_hdf5

b242b13

Merge branch 'SylviaWhittle/trace_stats_HDF5' of github.com:AFM-SPM/T…

92c9ddb

…opoStats into SylviaWhittle/trace_stats_HDF5

Add fixture for topostats test file rather than loading it in tests i…

00cdc13

…n io.py

Break out dict_to_hdf5 tests into unique tests

935a877

Break out hdf5_to_dict tests into unique tests

5d0f065

Move dict_almost_equal to test_io.py

a93a50d

Use a fixture for test_save_and_load_topostats_file

aaaecb4

Unsplit dict_to_hdf5

540a2e1

Re-import dict_almost_equal from test_io

643deb7

Rename read_topostats_file_data_dict to topostats_data

239cce3

Avoid type in variable name antipattern

4ae1b70

SylviaWhittle requested a review from ns-rse February 9, 2024 16:14

ns-rse approved these changes Feb 12, 2024

View reviewed changes

SylviaWhittle added this pull request to the merge queue Feb 12, 2024

Merged via the queue into main with commit 62e69c6 Feb 12, 2024
13 checks passed

SylviaWhittle deleted the SylviaWhittle/trace_stats_HDF5 branch February 12, 2024 11:14

ns-rse mentioned this pull request Feb 12, 2024

[feature] : Better Tracing & Skeletonisation Merger #800

Closed

30 tasks

ns-rse mentioned this pull request Feb 20, 2024

Remove saving of gaussian filtered arrays to .npy files #804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save grain trace data to HDF5 format #790

Save grain trace data to HDF5 format #790

SylviaWhittle commented Jan 31, 2024 •

edited

Loading

codecov bot commented Jan 31, 2024 •

edited

Loading

ns-rse left a comment

ns-rse left a comment

SylviaWhittle Feb 6, 2024

SylviaWhittle Feb 6, 2024

ns-rse left a comment

ns-rse Feb 6, 2024 •

edited

Loading

ns-rse Feb 6, 2024

ns-rse Feb 6, 2024

ns-rse Feb 6, 2024

ns-rse Feb 6, 2024

SylviaWhittle Feb 7, 2024

ns-rse Feb 7, 2024

ns-rse left a comment

ns-rse Feb 12, 2024

ns-rse Feb 12, 2024

ns-rse Feb 12, 2024

SylviaWhittle Feb 12, 2024

ns-rse left a comment

	read_topostats_file_data_dict = loadscans.img_dict["topostats_file_test"]
	topostats_data = loadscans.img_dict["topostats_file_test"]

		@@ -32,6 +33,48 @@
		# pylint: disable=too-many-lines


		def dict_almost_equal(dict1, dict2, abs_tol=1e-9):

Save grain trace data to HDF5 format #790

Save grain trace data to HDF5 format #790

Conversation

SylviaWhittle commented Jan 31, 2024 • edited Loading

codecov bot commented Jan 31, 2024 • edited Loading

Codecov Report

ns-rse left a comment

Choose a reason for hiding this comment

ns-rse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ns-rse left a comment

Choose a reason for hiding this comment

ns-rse Feb 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ns-rse left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ns-rse left a comment

Choose a reason for hiding this comment

SylviaWhittle commented Jan 31, 2024 •

edited

Loading

codecov bot commented Jan 31, 2024 •

edited

Loading

ns-rse Feb 6, 2024 •

edited

Loading