Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can we partition gather'ed signatures and further parallelize? #1245

Closed
ctb opened this issue Nov 26, 2020 · 2 comments
Closed

can we partition gather'ed signatures and further parallelize? #1245

ctb opened this issue Nov 26, 2020 · 2 comments

Comments

@ctb
Copy link
Contributor

ctb commented Nov 26, 2020

the greyhound experiment pre-screens database signatures for matches that have "interesting" containment overlaps with the query, which can be a major optimization for not just downstream containment reporting but ALSO speeds up gather and search, because upper bounds on containment also constrain Jaccard and gather matches.

we could provide further options for optimization and parallelization by performing some kind of clustering, wherein we detect/discover/collect disjoint subsets of overlapped hashes and then run gather only on them.

A simpler version of this idea that would speed up gather (and is already implemented in greyhound, I suspect) would be to take the pre-screened matches and discard all hashes in the query that have no overlaps with any signatures with containment, as they will have no impact on any gather outputs.

@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2021

A simpler version of this idea that would speed up gather (and is already implemented in greyhound, I suspect) would be to take the pre-screened matches and discard all hashes in the query that have no overlaps with any signatures with containment, as they will have no impact on any gather outputs.

supported by sourmash prefetch in #1370.

@ctb
Copy link
Contributor Author

ctb commented May 8, 2021

This is basically done in #1493 with the CounterGather functionality, just in a query dependent way. I'm closing for now, since it's not clear we need to speed gather up any more at this point 😆

@ctb ctb closed this as completed May 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant