Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Order of magnitude faster HWCD implementation for trees #217

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

NathanKolbow
Copy link
Contributor

Two functions are added here: hardwiredClusterDistance_treelike and hardwiredClusters_treelike. The runtimes of these functions grow linearly with respect to the number of edges in the input tree whereas the current implementation appears to grow exponentially.

This implementation is most useful when computing unrooted HWCD for large trees, because so many rooting combinations need to be checked.


Below are some figures showing that this implementation is marginally slower than the current implementation with fewer than ~75 taxa, but scales significantly better.

Runtime:

At n=1000, this new implementation is ~152x faster (27ms vs. 4.1s).

Memory Usage:

At n=1000, this new implementation uses ~1122x less memory (10.4 MiB vs. 11.4 GiB).


Below are benchmarks where tre is a 1,000 taxa tree that can be found here.

Old implementation:

julia> @benchmark hardwiredClusterDistance(tre, tre, true)
BenchmarkTools.Trial: 2 samples with 1 evaluation.
 Range (min  max):  3.585 s    3.632 s  ┊ GC (min  max): 8.82%  8.74%
 Time  (median):     3.609 s              ┊ GC (median):    8.78%
 Time  (mean ± σ):   3.609 s ± 32.832 ms  ┊ GC (mean ± σ):  8.78% ± 0.06%

  █                                                                   █
  █▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  3.59 s         Histogram: frequency by time        3.63 s <

 Memory estimate: 11.37 GiB, allocs estimate: 1502468.

New implementation:

julia> @benchmark hardwiredClusterDistance(tre, tre, true)
BenchmarkTools.Trial: 203 samples with 1 evaluation.
 Range (min  max):  23.405 ms  31.016 ms  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     23.956 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.690 ms ±  1.430 ms  ┊ GC (mean ± σ):  2.16% ± 3.65%

    █ ▂▂
  ▄█████▅▇▄▄▅▃▁▂▂▁▂▂▁▁▂▁▂▃▁▂▅▅▃▅▄▄▂▄▃▃▂▂▂▁▂▁▁▁▁▁▁▁▁▂▁▁▁▁▂▁▂▁▄ ▃
  23.4 ms         Histogram: frequency by time        28.8 ms <

 Memory estimate: 10.37 MiB, allocs estimate: 36755.

As written, these functions only work with trees, but they could be adapted to work with networks relatively easily.

Important note: hardwiredClusters_treelike and hardwiredClusterDistance_treelike were written for exclusively internal use, so (1) their docstrings are not very verbose, and (2) the return type of hardwiredClusters_treelike does not match the return type of hardwiredClusters. Both of these points could be changed with relatively little work if these functions are desired over the current implementations and are adapted to work on networks as well.

Copy link

codecov bot commented Sep 24, 2024

Codecov Report

Attention: Patch coverage is 96.22642% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/compareNetworks.jl 96.22% 2 Missing ⚠️
Files with missing lines Coverage Δ
src/compareNetworks.jl 98.54% <96.22%> (-0.78%) ⬇️

... and 1 file with indirect coverage changes

@cecileane
Copy link
Member

This is fantastic! I'm very excited about the gain in speed. I suspect that a similar gain could also be achieved for non-tree networks. 3 things:

  1. For now we aren't making new changes to the package, as we are doing major work to shrink PhyloNetworks (on branch dev), and to remove many functionalities from it, to place them in new packages (all in development branches for now): trait evolution in PhyloTraits, PhyLiNC and SNaQ.jl. If you could help migrate the snaq! functions to SNaQ.jl, that would be awesome, because there's just a basic skeleton for now, not even a dev branch. @crsl4 is in charge of this part.
  2. It would be fantastic if similar ideas were used for non-tree networks, both for a speed up on all types of networks, but also for a more consistent code base.
  3. After the refactoring of PhyloNetworks & co. packages is done, would you be interested in helping to implement the edge-based μ-distance between semidirected networks described here?
    • The μ-representation of a network contains more information than the set of hardwired clusters, and so it is a distance on a larger class of networks than the hardwired-cluster distance.
    • On trees, it extends both the rooted RF and the unrooted RF distance (like the hardwired cluster distance).
    • And importantly, it can be computed fast, with no need to try all possible rootings of each semidirected network being compared --hence another speed gain.

@NathanKolbow
Copy link
Contributor Author

  1. For now we aren't making new changes to the package, as we are doing major work to shrink PhyloNetworks (on branch dev), and to remove many functionalities from it, to place them in new packages (all in development branches for now): trait evolution in PhyloTraits, PhyLiNC and SNaQ.jl. If you could help migrate the snaq! functions to SNaQ.jl, that would be awesome, because there's just a basic skeleton for now, not even a dev branch. @crsl4 is in charge of this part.

Josh is working on moving things over to SNaQ.jl right now actually, and I'll be looking over the repo to make sure everything is there once he gets things moved.

  1. It would be fantastic if similar ideas were used for non-tree networks, both for a speed up on all types of networks, but also for a more consistent code base.

Yes! I think these could be adapted for networks very easily, I just personally needed a faster implementation specifically for trees.

  1. After the refactoring of PhyloNetworks & co. packages is done, would you be interested in helping to implement the edge-based μ-distance between semidirected networks described here?

    • The μ-representation of a network contains more information than the set of hardwired clusters, and so it is a distance on a larger class of networks than the hardwired-cluster distance.
    • On trees, it extends both the rooted RF and the unrooted RF distance (like the hardwired cluster distance).
    • And importantly, it can be computed fast, with no need to try all possible rootings of each semidirected network being compared --hence another speed gain.

This looks interesting! Does a project/codebase already exist for this? There does not seem to be any software implementation referenced in the paper.

@cecileane
Copy link
Member

3. After the refactoring of PhyloNetworks & co. packages is done, would you be interested in helping to implement the edge-based μ-distance between semidirected networks described here?

This looks interesting! Does a project/codebase already exist for this? There does not seem to be any software implementation referenced in the paper.

No, but it could be very useful to many. Hence my question!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants