Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement: allow both num and scaled to be set #538

Open
luizirber opened this issue Aug 28, 2018 · 8 comments
Open

Improvement: allow both num and scaled to be set #538

luizirber opened this issue Aug 28, 2018 · 8 comments

Comments

@luizirber
Copy link
Member

In the current implementation n and scaled are exclusive: if you set one, the other must be unset.

If we have enough hashes, we can convert from a scaled minhash back to a regular mash sketch (by truncating the hashes list at n). In cases where we do have enough hashes (like in a metagenome), it doesn't make much difference, but for cases where setting a scaled value lead to just a bunch of hashes (think: viruses), we can't apply this conversion.

I think we can support the conversion anyway, if we change a bit how we save the data internally:

  • let the user set both n and scaled
  • when adding hashes, collect enough to reach n
  • after reaching n, apply the max_hash comparison (derived from scaled)
  • when calling get_mins() or doing any operation with scaled set, discard any hash over max_hash
@luizirber luizirber changed the title Improvement: allow both n and scaled to be set Improvement: allow both num and scaled to be set Aug 28, 2018
@ctb
Copy link
Contributor

ctb commented Aug 28, 2018 via email

@luizirber
Copy link
Member Author

Hmm, it wouldn't change anything with searching if get_mins is still doing the right thing (returning the appropriate scaled or num mins). But curious to see what are the problems I'm missing =]

@ctb
Copy link
Contributor

ctb commented Dec 27, 2018

After writing the docs in #436 I now understand this idea. Yes, I like!

@luizirber
Copy link
Member Author

This involves changes in the C++ code, so I don't think we should implement it before #424 is merged =]

@luizirber
Copy link
Member Author

This would also allow a mash screen-like index, for datasets that are too small for scaled (think: viruses)

@ctb
Copy link
Contributor

ctb commented Mar 4, 2020

relevant: marbl/Mash#133

@ctb
Copy link
Contributor

ctb commented Jul 3, 2020

selectors #1072 could help with framing how to do the automatic conversion.

@ctb
Copy link
Contributor

ctb commented Mar 31, 2021

after #1420, this should be much easier (maybe even straightforward?) to implement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants