Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

write up downsampling details #407

Closed
ctb opened this issue Feb 19, 2018 · 4 comments
Closed

write up downsampling details #407

ctb opened this issue Feb 19, 2018 · 4 comments

Comments

@ctb
Copy link
Contributor

ctb commented Feb 19, 2018

there's an increasing amount of logic in sourmash around making sure that things get downsampled properly so that search/gather return accurate results.

this logic should be documented, and then enforced/made simple through the API.

FOR EXAMPLE,

  • SBTs cannot be downsampled to match signatures, but signatures can be downsampled before searching trees;
  • scaled and num can in certain circumstances be interconverted
  • sourmash gather (as currently implemented) potentially downsamples with each match.
@ctb
Copy link
Contributor Author

ctb commented Dec 30, 2018

Partially addressed in #436.

TODO:

  • sourmash gather potentially downsamples with each match
  • SBTs cannot be downsampled

@ctb
Copy link
Contributor Author

ctb commented Jan 13, 2019

Just to explain the above notes a bit more -- I guess a few points need to be made.

First, signatures can only be compared if they have the same num or the same scaled values. (That's intrinsic to the math.)

Downsampling to a common num or scaled value is always possible with signatures, but you lose resolution as you downsample because downsampling essentially always involves removing hashes.

SBTs use bloom filters for all internal nodes, and there is no way to remove hashes from a bloom filter. Therefore internal nodes cannot be downsampled without recalculating from scratch, which we don't support currently, and see no reason to support. This means that you can't search a signature with a scaled of 1e4 against of an SBT built on scaled signatures with scaled=1e3.

Signatures can be downsampled to match an SBT, however. So you can search a signature with a scaled of 1e2 against an SBT built on scaled signatures with scaled=1e3, because you can raise the scaled of the signature through downsampling.

Note, LCA databases can be downsampled.

@ctb
Copy link
Contributor Author

ctb commented May 3, 2020

see #928 for discussion of why having multi-scaled queries in gather would be hard to implement, due to the need to subtract low resolution signatures from high resolution signatures.

@ctb
Copy link
Contributor Author

ctb commented Apr 10, 2021

note that #1420 put this information into the db.select(...) docstrings for SBT and LCA databases. I think this can be closed now 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant