Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests(pruning): Tests TopoStatsPrune and convPrune #848

Conversation

ns-rse
Copy link
Collaborator

@ns-rse ns-rse commented May 29, 2024

  • Adds a smaller (30x30) test skeleton for testing pruning, this makes it easier to work with, however, this does raise a question as the resulting skeleton, generated by skimage.morphology.skeletonize() does have a T-shaped junction which it has been suggested never arise (more on this below).

  • Tests for TopoStatsPrune and convPrune attempt to demonstrate the effect of varying parameters has on pruning to which end the following parameters are varied...

    • max_length
    • height_threshold
    • method_values
    • method_outliers

There are some issues identified so far...

convPrune

As noted in the reason for the first paramterised test failing which arises when max_height = None (since int can not be compared to None) I think there is a problem with the default height being used here.

Line 454 tries to set the default max_branch_length to be 15% of the total number of elements in the skeleton array. self.skeleton.size counts the number of elements in an array (and because we are working with 2D images this will always be the product of the number of rows and columns). But the length of this will always be 1 since the total number of elements in the skeleton will always be an integer. Thus if max_length == -1 then the max_branch_length will always be 0 since we would always only ever get the number of elements in the array. That said in this dummy example we don't even get that, instead we get an error raised

max_length = -1
simple_array = np.asarray([[0, 0], [0, 0], [0, 0]])
total_points = simple_array.size
max_branch_length = max_length if max_length != -1 else int(len(total_points) * 0.15)

TypeError: object of type 'int' has no len()

Its also unclear why the convPrune._prune_by_length() method takes the argument max_length when its an attribute of the class which could be used instead.

This doesn't happen in the topostatsPrune class because the length of the molecule is based on the co-ordinates of the skeleton rather than the size of the array.

Excessive Complexity/Code duplication

The splitting of functionality here into multiple classes means there is some code duplication. Both topostatsPrune and convPrune have methods to _prune_by_length() conditionally based on the value of self.max_length not being None but they do so in different manners. Both methods then go on to called heightPruning() in the same manner.

The two classes could be combined into one with an option of whether to use the original length based pruning or the convolutional approach and call the appropriate method (renaming the method from convPrune to prune_length_by_conv() and the one from topostatsPrune to prune_length_by_neighbours(). In turn this would mean that there is no need to have the pruneSkeleton() factory class (and having that as a class in itself seems like overkill when a series of functions would suffice and address the complaints from Pylint on too-few-public-methods too).

Handling multiple grains

The refactoring done previously in #600 removed the loops from every method/function in the tracing module so that it only handled a single grain. The looping over of multiple grains is handled by code in process_scan.py, this means the prune_all_skeletons() methods can be removed too, further simplifying the code base.

These will be simple to remove in due course once we have robust tests in place.

T-shaped junctions

Previous work in #835 originally tested whether pruning.rm_nibs() "Remove single pixel branches (nibs) not identified by nearest neighbour algorithsm as there may be >2 neighbours" and the current tests show that these are still left by the pruning methods being tested, even though the last step is to re-run the plain skeletonisation method from skimage (Zhang's).

The following is a reproducible example that shows such a nib remains after pruning

import numpy as np
import matplotlib.pyplot as plt
from skimage import draw, filters
from skimage.morphology import skeletonize

def _generate_heights(skeleton: npt.NDArray, scale: float = 100, sigma: float = 5.0, cval: float = 20.0) -> npt.NDArray:
    """Generate heights from skeletons by scaling image and applying Gaussian blurring.

    Uses scikit-image 'skimage.filters.gaussian()' to generate heights from skeletons.

    Parameters
    ----------
    skeleton : npt.NDArray
        Binary array of skeleton.
    scale : float
        Factor to scale heights by. Boolean arrays are 0/1 and so the factor will be the height of the skeleton ridge.
    sigma : float
        Standard deviation for Gaussian kernel passed to `skimage.filters.gaussian()'.
    cval : float
        Value to fill past edges of input, passed to `skimage.filters.gaussian()'.

    Returns
    -------
    npt.NDArray
        Array with heights of image based on skeleton which will be the backbone and target.
    """
    return filters.gaussian(skeleton * scale, sigma=sigma, cval=cval)

def _generate_random_skeleton(**extra_kwargs):
    """Generate random skeletons and heights using skimage.draw's random_shapes()."""
    kwargs = {
        "image_shape": (128, 128),
        "max_shapes": 20,
        "channel_axis": None,
        "shape": None,
        "allow_overlap": True,
    }
    # kwargs.update
    heights = {"scale": 100, "sigma": 5.0, "cval": 20.0}
    kwargs = {**kwargs, **extra_kwargs}
    random_image, _ = draw.random_shapes(**kwargs)
    mask = random_image != 255
    skeleton = skeletonize(mask)
    return {"original": mask, "img": _generate_heights(skeleton, **heights), "skeleton": skeleton}

def pruned_plot(gen_shape: dict) -> None:
    """Plot the original skeleton, its derived height and the pruned skeleton."""
    img_skeleton = gen_shape
    pruned = topostatsPrune(
        img_skeleton["img"],
        img_skeleton["skeleton"],
        max_length=-1,
        height_threshold=90,
        method_values="min",
        method_outlier="abs",
    )
    pruned_skeleton = pruned._prune_by_length(pruned.skeleton, pruned.max_length)
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
    ax1.imshow(img_skeleton["original"])
    ax1.set_title("Original mask")
    ax2.imshow(img_skeleton["skeleton"])
    ax2.set_title("Skeleton")
    ax3.imshow(img_skeleton["img"])
    ax3.set_title("Gaussian Blurring")
    ax4.imshow(pruned_skeleton)
    ax4.set_title("Pruned Skeleton")
    plt.show()

def pruning_skeleton() -> dict:
    """Smaller skeleton for testing parameters of prune_all_skeletons()."""
    return _generate_random_skeleton(rng=69432138, min_size=15, image_shape=(30, 30))

pruned_skeleton(pruning_skeleton())

The paramterised test with id="Length pruning enabled (10), removes two ends leaving nibs on both." includes the resulting array after pruning and there is a two-pronged fork of nibs when using convPrune().

I think we need to ensure that remove_nibs() can handle such T-junctions as well as this dummy example demonstrates they are not correctly removed/pruned.

+ Adds a smaller (30x30) test skeleton for testing pruning, this makes it easier to work with, however, this does raise
  a question as the resulting skeleton, generated by `skimage.morphology.skeletonize()` does have a T-shaped junction
  which it has been suggested never arise (more on this below).
+ Tests for `TopoStatsPrune` and `convPrune` attempt to demonstrate the effect of varying parameters has on pruning to
  which end the following parameters are varied...

  + `max_length`
  + `height_threshold`
  + `method_values`
  + `method_outliers`

There are some issues identified so far...

`convPrune`

As noted in the `reason` for the first paramterised test failing which arises when `max_height = None` (since `int` can
not be compared to `None`) I think there is a problem with the default height being used here.

Line 454 tries to set the default max_branch_length to be 15% of the total number of elements in the skeleton
array. `self.skeleton.size` and counts the number of elements in an array (and because we are working with 2D images
this will always be the product of the number of rows and columns). But the length of this will _always_ be `1` since
the total number of elements in the skeleton will always be an integer. Thus if `max_length == -1` then the
`max_branch_length` will always be 0!

That said in this dummy example we don't even get that, instead we get an error raised

```
max_length = -1
simple_array = np.asarray([[0, 0], [0, 0], [0, 0]])
total_points = simple_array.size
max_branch_length = max_length if max_length != -1 else int(len(total_points) * 0.15)\

TypeError: object of type 'int' has no len()
```

Its also unclear why the `convPrune._prune_by_length()` mtehod takes the argument `max_length` when its an attribute of
the class which could be used instead.

This doesn't happen in the `topostatsPrune` class because the length of the molecule is based on the co-ordinates of the
skeleton rather than the size of the array.

Excessive Complexity/Code duplication

The splitting of functionality here into multiple classes means there is some code duplication. Both `topostatsPrune`
and `convPrune` have methods to `_prune_by_length()` conditionally based on the value of `self.max_length` not being
`None` but they do so in different manners. Both methods then go on to called `heightPruning()` in the same manner.

The two classes could be combined into one with an option of whether to use the original length based pruning or the
convolutional approach and call the appropriate method (renaming the method from `convPrune`  to
`prune_length_by_conv()` and the one from `topostatsPrune` to `prune_length_by_neighbours()`. In turn this would mean
that there is no need to have the `pruneSkeleton()` factory class (and having that as a class in itself seems like
overkill when a series of functions would suffice and address the complaints from Pylint on `too-few-public-methods`
too).

Handling multiple grains

The refactoring done previously in #600 removed the loops from every method/function in the tracing module so that it
only handled a single grain. The looping over of multiple grains is handled by code in `process_scan.py`, this means the
`prune_all_skeletons()` methods can be removed too, further simplifying the code base.

T-shaped junctions

Previous work in #835 originally tested whether `pruning.rm_nibs()` "Remove single pixel branches (nibs) not identified
by nearest neighbour algorithsm as there may be >2 neighbours" and the current tests show that these are still left by
the pruning methods being tested, even though the last step is to re-run the plain skeletonisation method from
`skimage` (Zhang's).

The following is a reproducible example that shows such a nib remains after pruning

```python

import numpy as np
import matplotlib.pyplot as plt
from skimage import draw, filters
from skimage.morphology import skeletonize

def _generate_heights(skeleton: npt.NDArray, scale: float = 100, sigma: float = 5.0, cval: float = 20.0) -> npt.NDArray:
    """Generate heights from skeletons by scaling image and applying Gaussian blurring.

    Uses scikit-image 'skimage.filters.gaussian()' to generate heights from skeletons.

    Parameters
    ----------
    skeleton : npt.NDArray
        Binary array of skeleton.
    scale : float
        Factor to scale heights by. Boolean arrays are 0/1 and so the factor will be the height of the skeleton ridge.
    sigma : float
        Standard deviation for Gaussian kernel passed to `skimage.filters.gaussian()'.
    cval : float
        Value to fill past edges of input, passed to `skimage.filters.gaussian()'.

    Returns
    -------
    npt.NDArray
        Array with heights of image based on skeleton which will be the backbone and target.
    """
    return filters.gaussian(skeleton * scale, sigma=sigma, cval=cval)

def _generate_random_skeleton(**extra_kwargs):
    """Generate random skeletons and heights using skimage.draw's random_shapes()."""
    kwargs = {
        "image_shape": (128, 128),
        "max_shapes": 20,
        "channel_axis": None,
        "shape": None,
        "allow_overlap": True,
    }
    # kwargs.update
    heights = {"scale": 100, "sigma": 5.0, "cval": 20.0}
    kwargs = {**kwargs, **extra_kwargs}
    random_image, _ = draw.random_shapes(**kwargs)
    mask = random_image != 255
    skeleton = skeletonize(mask)
    return {"original": mask, "img": _generate_heights(skeleton, **heights), "skeleton": skeleton}

def pruned_plot(gen_shape: dict) -> None:
    """Plot the original skeleton, its derived height and the pruned skeleton."""
    img_skeleton = gen_shape
    pruned = topostatsPrune(
        img_skeleton["img"],
        img_skeleton["skeleton"],
        max_length=-1,
        height_threshold=90,
        method_values="min",
        method_outlier="abs",
    )
    pruned_skeleton = pruned._prune_by_length(pruned.skeleton, pruned.max_length)
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2)
    ax1.imshow(img_skeleton["original"])
    ax1.set_title("Original mask")
    ax2.imshow(img_skeleton["skeleton"])
    ax2.set_title("Skeleton")
    ax3.imshow(img_skeleton["img"])
    ax3.set_title("Gaussian Blurring")
    ax4.imshow(pruned_skeleton)
    ax4.set_title("Pruned Skeleton")
    plt.show()

def pruning_skeleton() -> dict:
    """Smaller skeleton for testing parameters of prune_all_skeletons()."""
    return _generate_random_skeleton(rng=69432138, min_size=15, image_shape=(30, 30))

pruned_skeleton(pruning_skeleton())
```

The paramterised test with `id="Length pruning enabled (10), removes two ends leaving nibs on both."` includes the
resulting array after pruning and there is a two-prunged fork of nibs when using `convPrune()`.

I think it would be useful if `remove_nibs()` were able to handle such T-junctions as well.
@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 3, 2024

From @MaxGamill-Sheffield

Essentially convPrune was a way to use convolutions to make the pruning more intuitive and smaller code-wise compared to topostats prune - a side-by-side refactor if you will but I can’t remember if I ever finished it, or if it’s useful / upto date any more

In light of this I will undertake the following...

  • Remove convPrune class, it likely isn't used and as noted above possibly has an error in the _prune_by_length() method where the length of the molecule is calculated because the size of the whole 2D binary skeleton array is taken (in contrast to topostatsPrune._prune_by_length() which uses len(coordinates) of the skeleton).
  • Remove pruneSkeleton class and simplify to a simple set of functions implementing a "factory pattern" with a single factory method which calls topostatsPrune(), this leaves it in a state where extension will be possible if alternative pruning methods are to be implemented.
  • Remove tests that have been written for convPrune (under tests/tracing/test_pruning.py::TestConvPrune).

+ Remove convPrune class.
+ Removes static methods, this makes intermediary steps/data available and avoids creating and passing around objects as
  they are instead class attributes.
+ Remove `remove_bridges()` method of `heightPrune` class as its identical to `height_prune()` method.
+ Splits the splitting and labelling of branches into a new static method (test yet to be written).
+ Remove associated/redundant unit-tests for removed classes and methods.
@ns-rse ns-rse force-pushed the ns-rse/818-tests-pruning-topostats-conv branch from c4fd2f5 to 119b0d0 Compare June 7, 2024 09:18
@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 7, 2024

@MaxGamill-Sheffield Its not just renaming of variables that is causing the current test suite to fail, after doing some digging and scratching my head a lot, working out that we now need to pass all of the new options to dnaTrace() class in and doing so from the default_config() I found that the pruning step undertaken by get_disordered_trace() doesn't return the same pruned objects as previously.

I've therefore marked a bunch of tests in topostats/tracing/test_tracing_single_grain.py to be skipped until we can work out what parameter values from the existing code base (where most are hard coded) need to be set in the current configuration. There are probably other tests that fail but I'm taking one step at a time.

Its compounded by the hassle of having to take a list of co-ordinates and revert them to a 2D array to plot (I've left a note in 119b0d0 with some hacky code of how to do this as much for my own reference when I return to this later next week).

If you have any time/capacity to start investigating this it would be really useful. I was under the mistaken impression that you might have been using the existing test suite as you were refactoring and ensured that these passed or were updated as required (that is one of the major benefits of tests) but that isn't the case so we'll have to work through this now.

Work on getting the test suite to work with the refactored code. Long way off working at the moment though.

+ Pass `default_config.yaml` into the `dnatrace_linear`/`dnatrace_circular` fixtures so we have all options in place.
+ Skip a bunch of tests that are contingent on the pruning having been done correctly, this will require tweaking of the
  `skeletonise_params` so that correctly pruned skeletons are returned.
+ `self.grain` > `self.mask` I found it confusing that the original image after smoothing was called
  `self.smoothed_grain` when it was heights and the mask was called `self.grain`. By having the mask named as such it
  avoids confusion and makes the attributes clearer.

A major hindrance to debugging this is when the skeleton is converted from an array to a list of co-ordinates that
define the location of points.

This makes it a bit of a pain to then plot and see what we have can be achieved with the following messy code...

```
original_skeleton_coords = np.asarray([[28, 47],
       [ 28,  48],
       [ 29,  46],
       [ 29,  48],
       [ 30,  47],
])
skeleton = np.zeros((original_skeleton_coords.max() + 2, original_skeleton_coords.max() + 2))
skeleton[original_skeleton_coords[:,0], original_skeleton_coords[:,1]] = 1
plt.imshow(skeleton)
plt.show()
```
@ns-rse ns-rse force-pushed the ns-rse/818-tests-pruning-topostats-conv branch from 119b0d0 to c1ebb61 Compare June 7, 2024 09:58
@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 12, 2024

I've been looking at why many of the current set of tests in tests/tracing/test_tracing_single_grain.py fail and its because the call to get_disordered_trace() does return the same shape skeleton.

Passing two sets of heights

We've already identified and corrected a problem with getSkeleton() which was mistakenly being passed the original image and instead of the mask the image after it had been passed through Gaussian filter (see 84d0896). The gaussian_filter() method is applied to the heights (self.image line 295) and this smoothed image self.smoothed_grain was being passed into getSkeleton() along with the original self.image (call starts on line 465).

I've found the use of grain img and mask to be somewhat inconsistent and a possible source of confusion and as well as correcting the above problem so that an img and mask are passed have sought to use mask wherever a binary array is expected and img wherever heights are expected and remove grain from the equation.

Pruning not as expected

The get_disordered_trace() still doesn't return the pruned skeletons we expect based on the existing tests in tests/tracing/test_tracing_single_grain.py::test_get_disordered_trace. This only asserts that the size of the arrays are as expected and that the start and end co-ordinates are the same. Tests fail on checking the length as longer skeletons are now returned by the get_disordered_trace() method.

This function in the current refactored code calls in order...

    def get_disordered_trace(self):
        """
        Derive the disordered trace coordinates from the binary mask and image via skeletonisation and pruning.
        """
        self.skeleton = getSkeleton(
            self.smoothed_grain,
            self.mask,
            method=self.skeletonisation_params["method"],
            height_bias=self.skeletonisation_params["height_bias"],
        ).get_skeleton()
        self.pruned_skeleton = prune_skeleton(self.smoothed_grain, self.skeleton, **self.pruning_params.copy())
        self.pruned_skeleton = self.remove_touching_edge(self.pruned_skeleton)
        self.disordered_trace = np.argwhere(self.pruned_skeleton == 1)

There are eight parameterised tests as there are four skeletonisation methods (topostats, zhang, lee and thin)
and two grains to be tested (circular and linear).

To investigate I've plotted the returned skeleton after pruning from the existing code on main and under this branch
(maxgamill-sheffield/800-better-tracing / ns-rse/818-tests-pruning-topostats-conv).

NB - In light of the artifacts in the images on this branch not being clearly binary (yellow) plots I've double checked that the skeletons returned on this branch are in fact binary arrays and they are.

Method Molecule Branch Image
topostats linear main
topostats linear maxgamill-sheffield/800-better-tracing
topostats circular main
topostats circular maxgamill-sheffield/800-better-tracing
zhang linear main
zhang linear maxgamill-sheffield/800-better-tracing
zhang circular main
zhang circular maxgamill-sheffield/800-better-tracing
lee linear main
lee linear maxgamill-sheffield/800-better-tracing
lee circular main
lee circular maxgamill-sheffield/800-better-tracing
thin linear main
thin linear maxgamill-sheffield/800-better-tracing
thin circular main
thin circular maxgamill-sheffield/800-better-tracing

The newly introduced pruning_params are passed in to the prune_skeleton() call and I've double checked that in my refactoring to remove the excessive complexity that the **kwargs are correctly passed through the factory method to the topostatsPrune class (and they are).

I suspect some of these parameters though are different from the hard-coded method that is used on main.

The branches on the skeletons from main when using zhang / lee / thin methods are to be expected since pruning was not applied (only the topostats method prunes as it is called from within the getSkeleton class. This is performed by pruneSkeleton which does so by setting max_branch_length to be 15% of the length_of_trace (hard coded). Refactoring makes this configurable but default value of -1 used in configuration triggers this default to be set.

Repeated Pruning

Looking at the getSkeleton() class on main the self.doSkeletonising() method is called.

This method (see here) repeatedly calls the pruneSkeleton() method until while self.pruning: which may be a source of differences under the topostats method.

In the current refactored code this repeated pruning doesn't appear to occur. Skeletonisation and pruning have been separated out and within get_disordered_trace() (post correction mentioned above so that self.mask is passed to getSkeleton) the steps are...

  1. getSkeleton(...).get_skeleton()
  2. A single call to prune_skeleton()
  3. remove_touching_edges()
  4. get skeleton co-ordinates via np.argwhere().

Perhaps the reason we're not seeing the same skeletons is because only a single round of pruning is being undertaken? 🤔

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 18, 2024

Looked at this more this morning and have found that in the current refactored when get_disordered_trace() the repeated pruning of branches is still in place within the pruning.pruneSkeleton._prune_by_length() method (see line 251 where the while pruning: loop starts

However, within this loop the max_branch_length is recalculated on each iteration (see line 259). This means that as branches are pruned the max_branch_length itself is reduced.

In contrast the main branch in the topostats.tracing.getSkeleton.pruneSkeleton() method max_branch_length is calculated once before and outside of the (see line 780 and remains constant throughout the subsequent loops as its not recalculated.

I think this might be the reason some of the branches are not pruned when running the test suite on the refactored code (branch maxgamill-sheffield/800-better-tracing from which ns-rse/818-tests-pruning-topostats-conv was made).

I shall be investigating further today.

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 18, 2024

Currently on maxgamill-sheffield/800-better-tracing and this branch ns-rse/818-tests-pruning-topostats-conv the max_branch_length is recalculated on each round of pruning.

In tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_test this means that on the first set of parameters the maximum allowed branch length (in terms of number of coordinates) reduces on each iteration and a total of three iterations are run...

INFO     topostats:pruning.py:282 [pruning] : Pruning by length.
INFO     topostats:pruning.py:341 [pruning] : Maximum branch length : 44
INFO     topostats:pruning.py:341 [pruning] : Maximum branch length : 29
INFO     topostats:pruning.py:341 [pruning] : Maximum branch length : 26

This isn't the behaviour in the current main branch (see above comment) and if we therefore calculate the max_branch_length once outside of the while pruning: loop we don't observe any reduction in the branch length but we still only see three iterations...

INFO     topostats:pruning.py:341 [pruning] : Maximum branch length : 45
INFO     topostats:pruning.py:341 [pruning] : Maximum branch length : 45
INFO     topostats:pruning.py:341 [pruning] : Maximum branch length : 45

Not sure why the length is one coordinate longer to start with 🫤

I can understand why the length might be reassessed after pruning so am unsure if this change is deliberate @MaxGamill-Sheffield

However, we still have not pruned all of the branches that the test expected to be pruned (perhaps as we only have three iterations still). The logic for controlling

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 20, 2024

Looking deeper into this I've found that the initial skeletonisation using the TopoStats method is perhaps the underlying cause of problems we are seeing with pruning.

NB is a branch off of maxgamill-sheffield/800-better-tracing

Molecule Pruning Method main ns-rse/818-tests-pruning-topostats-conv
linear topostats main_topostats_linear refactored_topostats_linear
circular topostats main_topostats_circular refactored_topostats_circular
linear zhang main_zhang_linear refactored_zhang_linear
circular zhang main_zhang_circular refactored_zhang_circular
linear lee main_lee_linear refactored_lee_linear
circular lee main_lee_circular refactored_lee_circular
linear thin main_thin_linear refactored_thin_linear
circular thin main_thin_circular refactored_thin_circular

Okay this is strange, at the very least I would expect the Zhang/Lee/Thin methods to return the exact same initial skeletons, perhaps the masks that are being passed in are different.

Molecule Method main ns-rse/818-tests-pruning-topostats-conv
linear topostats main_topostats_linear_mask refactor_topostats_linear_mask
circular topostats main_topostats_circular_mask refactor_topostats_circular_mask
linear zhang main_zhang_linear_mask refactor_zhang_linear_mask
circular zhang main_zhang_circular_mask refactor_zhang_circular_mask
linear lee main_lee_linear_mask refactor_lee_linear_mask
circular lee main_lee_circular_mask refactor_lee_circular_mask
linear thin main_thin_linear_mask refactor_thin_linear_mask
circular thin main_thin_circular_mask refactor_thin_circular_mask

This shows that the masks that the refactored code are passing in to each of the skeletonisation methods are different and would go some way to explaining why we are seeing different disordered traces which are the result of Skeletonisation and Pruning^[1].

Now it is worth noting the discovery above that found that a Gaussian smoothed image of heights was being passed into getSkeleton() as the second positional argument which should be a mask (see line 465).

Currently this has been switched to self.mask but in light of the above and comparing to the main branch its clear that the Gaussian filtered image (of heights self.smoothed_grain) should be converted to a smoothed mask (self.smoothed_mask ?) for passing to getSkeleton.

On the main branch the step that does this can be found on line 272 where...

        smoothed_grain = ndimage.binary_dilation(self.grain, iterations=1).astype(self.grain.dtype)

...and smoothed_grain is passed as the second argument to getSkeleton() on line 282 (note the very_smoothed_grain that is generated does not appear to be used anywhere).

Thus we need to apply ndimage.binary_dilation(self.mask, iterations=1).astype(self.mask.dtype) to the mask that is passed into getSkeleton() in the refactored branch.

Somewhat confusingly there is the dnaTrace.smooth_grains() method which is new and not currently tested in the existing test suite (it is used in the trace_dna() method to get the smoothed image but it assigns a binary mask to the self.smoothed_image which is conflating images and masks again. The existing tests that are under investigation do not use this method, rather they use the dnaTrace.gaussian_filter() method...

    dnatrace.skeletonisation_params["method"] = skeletonisation_method
    dnatrace.gaussian_filter()
    dnatrace.get_disordered_trace()

This method is applied to the self.image i.e. heights, but again confusingly this smoothed image is saved to self.smoothed_grain which is ambiguous so I have renamed it to self.smoothed_image and will introduce a self.smoothed_mask which will hopefully remove the ambiguity around what a grain is (as it looks like its sometimes image heights and sometimes binary masks).

Another and perhaps simpler approach might be to have getSkeleton() take only a single image, that of heights, and have it always convert this to a binary mask prior to skeletonisation. This would be simpler and reduce the complexity of passing lots of things around that need to be kept in sync.

^[1] : On the main branch when the method is topostats skeletonisation and pruning are combined since pruning is called directly after skeletonisation. On the main branch the zhang / lee / thin methods do not call pruning methods. In contrast the refactored branches pruning has been separated out of the getSkeleton class and as such skeletonisation is under taken and pruning is then performed on the resulting skeleton, regardless of method (i.e. topostats / zhang / lee / thin skeletonisation are all followed by pruning).

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 20, 2024

Having now tested the above and ensured that ndimage.binary_dilation() is applied to the mask and that this dilated mask is passed in the refactored code now passes the same mask into the skeletonisation methods.

Linear

main

main_topostats_linear_mask

ns-rse/818-tests-pruning-topostats-conv

refactored_linear_using_dilated_mask

Circular

main

main_zhang_circular_mask

ns-rse/818-tests-pruning-topostats-conv

refactored_circular_using_dilated_mask

This is a step closer, the tests fail as there are differences in the number of points in the disordered trace but for the zhang / lee / thin mtehods this is to be expected because pruning is now applied to the skeletons returned by each of these methods and as we can crudely see in the outuput below based on the failed tests the number of points has reduced for these methods. What we need to investigate now is why the topostats method also reduces the number of points.

====================================================== short test summary info ======================================================
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[linear molecule, skeletonise topostats] - assert 125 == 120
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[circular molecule, skeletonise topostats] - assert 143 == 150
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[linear molecule, skeletonise zhang] - assert 122 == 170
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[circular molecule, skeletonise zhang] - assert 149 == 184
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[circular molecule, skeletonise lee] - assert 151 == 177
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[linear molecule, skeletonise thin] - assert 118 == 187
FAILED tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace[circular molecule, skeletonise thin] - assert 175 == 190
============================================== 7 failed, 1 passed in 224.82s (0:03:44) ==============================================

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 20, 2024

Looking in detail at the final skeletons that are produced by the refactored code now that the skeletonisation method is passed a binary dilated mask and comparing these to the skeletons returned on the main branch.

The diff column shows the differences between the two branches (apologies for these not being to scale, its a quick hack to look at differences)

Molecules Method main ns-rse/818-tests-pruning-topostats-conv diff
linear topostats main_topostats_linear_final refactor_topostats_linear_final diff-linear-topostats-final
circular topostats main_topostats_circular_final refactor_topostats_circular_final diff-circular-topostats-final
linear zhang main_zhang_linear_final refactor_zhang_linear_final diff-linear-zhang-final
circular zhang main_zhang_circular_final refactor_zhang_circular_final diff-circular-zhang-final
linear lee main_lee_linear_final refactor_lee_linear_final diff-linear-lee-final
circular lee main_lee_circular_final refactor_lee_circular_final diff-circular-lee-final
linear thin main_thin_linear_final refactor_thin_linear_final diff-linear-thin-final
circular thin main_thin_circular_final refactor_thin_circular_final diff-circular-thin-final

We can see that for the zhang / lee / thin methods there is as suggested above an improvement as the branches which were not pruned on main since pruning was a method of the getSkeleton() class are now removed on branch ns-rse/818-tests-pruning-toposats-conv as @MaxGamill-Sheffield has refactored pruning into its own class and it is applied after skeletonisation.

This is an improvement and I would be happy to update the tests in light of these changes as from the diff plots show the underlying skeletons have simply been pruned.

For the topostats method though there are subtle changes in the final skeleton. The linear one has a slightly larger loop in the bottom right hand corner and a kink on the long arm near the top left. Differences in the circular molecule are harder to detect.

Would be useful to hear your thoughts on what might underpin these changes @MaxGamill-Sheffield and whether what is currently returned is acceptable before I dig deeper. I do appreciate you undertook this work some time ago.

I'm going to go through my current branch and remove reference to grain and ensure that the terms image and mask are instead consistently used to remove any and all ambiguity about what is being passed around/used at different steps.

@MaxGamill-Sheffield
Copy link
Collaborator

Sorry for the late reply here are my comments:

  1. Yes the input masks to the skeletonisation algorithm are different - this is because we found that topological features such as holes would be removed by the iterative binary dilation. As a result, this line in main was replaced with the smooth_grains() func which takes a grain (binary), dilates or blurs it and returns a binary mask.
  2. The variable smoothed_grains is in fact a mask, not an image. Or at least it should be. None of the variables used in the smooth_grains() func are images.
  3. The reason for the non-smoothing of the grains might be the terribly low sigma value. Previously, the grains were not passed in 1-at-a-time so the gaussian_sigma value found using the hardcoded max(grain.shape) / 256 which is expected to be > 1 but on reflection this might not be the case causing the image not to be smoothed, and skeletonisation failing.
  4. At least in my head "grain" was always used as an internal name and interchangeably for masks - I agree though consistency is key here so just using masks would be better.

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 24, 2024

Thanks for having a look @MaxGamill-Sheffield

The existing test suite which is where a bunch of tests all fail and I'm currently investigating break down the processing steps and test individually and sequentially (so we can identify at what point breakages occur when undertaking refactoring)...

dnatrace.gaussian_filter()
dnatrace.get_disordered_trace()
<optional additional steps if required>

...as that was how the workflow ran when the tests were written and because gaussian_filter() smooths an image and not a binary mask its why the tests weren't working.

It's fine to have introduced dnatrace.smooth_grains() method as an alternative but the tests need updating when such refactoring is undertaken so that they a) actually use the alternative method (i.e. smooth_grains() instead of gaussian_filter() ; b) account for the changes that these may have on the test results (its ok for test results to change when we know we have modified the underlying methods).

  1. The variable smoothed_grains is in fact a mask, not an image. Or at least it should be. None of the variables used in the smooth_grains() func are images.

That might be true in your refactoring of dnaTrace.trace_dna() (lines 170-190 of 84d08966) but in that same commit there was, as we discussed, a renaming of self.gauss > self.smoothed_grain within the gaussian_filter() function when a Gaussian blur was applied to self.image (i.e. the original height images and not a mask).

The unit-tests are failing at tests/tracing/test_dnatracing_single_grain.py::test_get_disordered_trace and this is because dnatrace.gaussian_filter() is still being used which blurred the original image and didn't dilate the mask as you say smooth_grain() now does. In turn the call to dnatrace.get_disordered_trace() receives image = self.image and mask = self.smoothed_grain where self.smoothed_grain is image heights and not a binary mask. Thus even though in dnaTrace.trace_dna() the call to .get_disordered_trace() is preceeded by smooth_grain() rather than gaussian_filter() the tests failed as they hadn't been adapted to account for this change.

  1. The reason for the non-smoothing of the grains might be the terribly low sigma value. Previously, the grains were not passed in 1-at-a-time so the gaussian_sigma value found using the hardcoded max(grain.shape) / 256 which is expected to be > 1 but on reflection this might not be the case causing the image not to be smoothed, and skeletonisation failing.

I've not got to checking the tests for this yet but looking at the above mentioned PR the gaussian_sigma value is initially hard coded during class instantiation as...

self.sigma = 0.7 / (self.pixel_to_nm_scaling * 1e9)

...(both on main and maxgamill-sheffield/800-better-tracing) and it looks, as far as I can tell, that this value would have been used on main. In maxgamill-sheffield/800-better-tracing though this value isn't passed as an argument to the call to self.smooth_mask() (as its coming from the **kwargs of the parameters defined in topostats/default_config.yaml) and it is therefore set to max(mask.shape) / 256 or the user supplied value (default null and so default is to use the maximum mask dimension.

We can likely therefore do away with the line...

self.sigma = 0.7 / (self.pixel_to_nm_scaling * 1e9)

...as it is only used in the gaussian_filter() method which by the sounds of it is redundant as it has been replaced by smooth_grains().

I'll work on tidying this up and removing redundant code and getting the tests back on track.

Work on getting the test suite to work with the refactored code. Long way off working at the moment though.

+ Pass `default_config.yaml` into the `dnatrace_linear`/`dnatrace_circular` fixtures so we have all options in place.
+ Skip a bunch of tests that are contingent on the pruning having been done correctly, this will require tweaking of the
  `skeletonise_params` so that correctly pruned skeletons are returned.
+ `self.grain` > `self.mask` I found it confusing that the original image after smoothing was called
  `self.smoothed_grain` when it was heights and the mask was called `self.grain`. By having the mask named as such it
  avoids confusion and makes the attributes clearer.

A major hindrance to debugging this is when the skeleton is converted from an array to a list of co-ordinates that
define the location of points.

This makes it a bit of a pain to then plot and see what we have can be achieved with the following messy code...

```
original_skeleton_coords = np.asarray([[28, 47],
       [ 28,  48],
       [ 29,  46],
       [ 29,  48],
       [ 30,  47],
])
skeleton = np.zeros((original_skeleton_coords.max() + 2, original_skeleton_coords.max() + 2))
skeleton[original_skeleton_coords[:,0], original_skeleton_coords[:,1]] = 1
plt.imshow(skeleton)
plt.show()
```
The `dnaTrace.gaussian_filter()` method is obsolete and has been replaced by `dnaTrace.smooth_mask()`. The obsolete
function has therefore been removed and the related test adapted and expanded to test `test_smooth_mask()` with
parameters that vary the number of `dilation_iterations` and the value for `gaussian_sigma`.

The new `dnaTrace.smooth_mask()` performs binary dilation of the mask and compares this to a gaussian blur of the
mask. Whichever results in the smallest difference compared to the original mask then has "holes readded" and the
`self.smoothed_mask` is updated.

It is important to note that this differs from previous methods where the Gaussian filter was applied to the original
images of heights. Why this switch was made is unclear.

Additional work

+ Refactoring to remove mention of `grains` as the term is used in multiple places, sometimes with the original height
  array and sometimes with the mask array. This ambiguity can lead to confusion instead the terms `image` and `mask` are
  used.
Converting use of `grain` to `mask` in `nodestats` to align with nomenclature in `dnatracing`

Its more consistent and descriptive of what the arrays are in terms of NumPy parlance (they are all binary masks) and
avoids confusion where at times in the variables labelled `grain` historically contained height values and more recently
it contains masked data that is binary.
@MaxGamill-Sheffield
Copy link
Collaborator

MaxGamill-Sheffield commented Jun 25, 2024

X-post from slack:
Hey @ns-rse , these functions (gaussian_filter() and smooth_grains()) aren’t the same thing, they work on different inputs and also solve different problems:
• Gaussian blurring of the image in main was also hugely beneficial when we had the “fitted trace” step which attempted to shift the unordered coordinates onto the molecule backbone, a step which is now done using the height-biased skeletonisation method.
• Smoothing of the masks in 800-better-tracing incorporates the grain smoothing that is done in main’s get_disordered_trace() to combat the production of spurious branches when skeletonising. We now do both a dilation and blur and see which one has affected the least amount of pixels and choose that as the smoothed mask. The “very_smoothed_grain” in main which blurs the smoothed grain is not actually used.

My guess for why the new smooth_mask() function no longer operates as expected and gives different results is because the hardcoded sigma value (! not deffine at init but in smooth_grains()!) is too small due to the new input of a grain with a shape smaller than the older input in the cats branch which was an image, and thus nothing is smoothed.

Converting use of `grain` to `mask` in `nodestats` to align with nomenclature in `dnatracing`

Its more consistent and descriptive of what the arrays are in terms of NumPy parlance (they are all binary masks) and
avoids confusion where at times in the variables labelled `grain` historically contained height values and more recently
it contains masked data that is binary.

Improves type hints for numpy arrays.
Updates the test to reflect the changes introduced by switching from applying `guassian_filter()` to the original image
heights to the `smooth_mask()` which uses either binary dilation or Gaussian filtering applied to the binary mask.

Tests updated

+ `test_get_disordered_trace()`
This method is no longer used so it has been removed along with tests.
This reverts commit 594a3f1d739b12ca0654e36ee41355d532d95430.
@ns-rse
Copy link
Collaborator Author

ns-rse commented Jun 26, 2024

Hi @MaxGamill-Sheffield

Commits reverted so gaussian_filter() method and get_ordered_trace() are reinstated (although tests fail).

@MaxGamill-Sheffield
Copy link
Collaborator

Happy for this to be merged, I just fixed a few little bits that prevented the workflow from working:

  1. dnaTrace input var grain -> mask
  2. dnatracing var smoothed_grains -> smoothed_mask and func smooth_grains() -> smooth_mask()
  3. Nodestats skeletonisation and pruning references to use the newer methods from the config.

I have not updated the dnatracing tests as I thought this would be more relevant in the dnatracing PR, not this focused Prune / skeletonisation PR

@ns-rse
Copy link
Collaborator Author

ns-rse commented Jul 2, 2024

Thanks, the git revert's I undertook to put the gaussian_filter() and get_ordered_trace() functions back in not only restored the functions but also some of the renaming I had done that you have caught and reverted back again.

I must work harder on keeping my commits atomic!

@MaxGamill-Sheffield MaxGamill-Sheffield merged commit 4d5bad0 into maxgamill-sheffield/800-better-tracing Jul 2, 2024
2 checks passed
@MaxGamill-Sheffield MaxGamill-Sheffield deleted the ns-rse/818-tests-pruning-topostats-conv branch July 2, 2024 09:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants