Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

further refactor gather_databases() function in search.py #1517

Closed
ctb opened this issue May 12, 2021 · 3 comments
Closed

further refactor gather_databases() function in search.py #1517

ctb opened this issue May 12, 2021 · 3 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 12, 2021

per @luizirber, on running gather on a gigantic signature --

Right now it is spending all the time inside sum_abunds = sum(( orig_query_abunds[k] for k in orig_query_hashes)), because it is pulling each hash of the original query individually (all, err, billions of them?)

Digging into this code, it does this each iteration - and the only reason is because the original query may have been downsampled. Most times it won't need to be calculated more than once.

We should be easily able to refactor this code in one or two ways -

  • first, cache this by scaled value, maybe? that would be easy to do.
  • second, refactor this code out to a method on MinHash and (going further) then oxidize it.

This is the trend we're heading towards in #1512 and previous, too - move stuff away from Python and into MinHash.

Naively I wonder if this is or could be solved by a similar function to the one needed for #1463

@ctb
Copy link
Contributor Author

ctb commented May 12, 2021

aiee! weighted_missed (which is what this is used to calculate) isn't even used in the loop 😆 . So this is a largely unnecessary calculation anyway!

There are others that are similar in there, tho, that could be improved. But they're all much, much smaller.

@mr-eyes
Copy link
Member

mr-eyes commented Jun 11, 2021

What do I need here to get started?

@ctb
Copy link
Contributor Author

ctb commented Sep 23, 2021

I think this was fixed in #1613, actually.

@ctb ctb closed this as completed Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants