-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] expand signature selection and compatibility checking in database loading code #1420
Changes from all commits
7f52d7c
dde14fd
3f498a4
df19926
e8233ca
b44c3cf
7133ac1
f5f1c9c
d6f156f
785a9a4
23d7ac4
efc07cd
12399e7
2b7acb9
f663426
9aae1cb
2630be2
4f1a7fe
736ddf3
16719ce
6d8663e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -171,6 +171,7 @@ class SBT(Index): | |
We use two dicts to store the tree structure: One for the internal nodes, | ||
and another for the leaves (datasets). | ||
""" | ||
is_database = True | ||
|
||
def __init__(self, factory, *, d=2, storage=None, cache_size=None): | ||
self.factory = factory | ||
|
@@ -189,19 +190,60 @@ def signatures(self): | |
for k in self.leaves(): | ||
yield k.data | ||
|
||
def select(self, ksize=None, moltype=None): | ||
first_sig = next(iter(self.signatures())) | ||
def select(self, ksize=None, moltype=None, num=0, scaled=0, | ||
containment=False): | ||
"""Make sure this database matches the requested requirements. | ||
|
||
ok = True | ||
if ksize is not None and first_sig.minhash.ksize != ksize: | ||
ok = False | ||
if moltype is not None and first_sig.minhash.moltype != moltype: | ||
ok = False | ||
Will always raise ValueError if a requirement cannot be met. | ||
|
||
if ok: | ||
return self | ||
The only tricky bit here is around downsampling: if the scaled | ||
value being requested is higher than the signatures in the | ||
SBT, we can use the SBT for containment but not for | ||
similarity. This is because: | ||
|
||
raise ValueError("cannot select SBT on ksize {} / moltype {}".format(ksize, moltype)) | ||
* if we are doing containment searches, the intermediate nodes | ||
can still be used for calculating containment of signatures | ||
with higher scaled values. This is because only hashes that match | ||
in the higher range are used for containment scores. | ||
* however, for similarity, _all_ hashes are used, and we cannot | ||
implicitly downsample or necessarily estimate similarity if | ||
the scaled values differ. | ||
""" | ||
# pull out a signature from this collection - | ||
first_sig = next(iter(self.signatures())) | ||
db_mh = first_sig.minhash | ||
|
||
# check ksize. | ||
if ksize is not None and db_mh.ksize != ksize: | ||
raise ValueError(f"search ksize {ksize} is different from database ksize {db_mh.ksize}") | ||
|
||
# check moltype. | ||
if moltype is not None and db_mh.moltype != moltype: | ||
raise ValueError(f"search moltype {moltype} is different from database moltype {db_mh.moltype}") | ||
|
||
# containment requires 'scaled'. | ||
if containment: | ||
if not scaled: | ||
raise ValueError("'containment' requires 'scaled' in SBT.select'") | ||
if not db_mh.scaled: | ||
raise ValueError("cannot search this SBT for containment; signatures are not calculated with scaled") | ||
|
||
# 'num' and 'scaled' do not mix. | ||
if num: | ||
if not db_mh.num: | ||
raise ValueError(f"this database was created with 'scaled' MinHash sketches, not 'num'") | ||
if num != db_mh.num: | ||
raise ValueError(f"num mismatch for SBT: num={num}, {db_mh.num}") | ||
|
||
if scaled: | ||
if not db_mh.scaled: | ||
raise ValueError(f"this database was created with 'num' MinHash sketches, not 'scaled'") | ||
|
||
# we can downsample SBTs for containment operations. | ||
if scaled > db_mh.scaled and not containment: | ||
raise ValueError(f"search scaled value {scaled} is less than database scaled value of {db_mh.scaled}") | ||
Comment on lines
+216
to
+244
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks a lot like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. precisely - better error messages! ( it's not actually identical, I don't think, and I shied away from forcing it into the same code - if, after a lot of testing, it turns out to be identical, we can refactor.) |
||
|
||
return self | ||
|
||
def new_node_pos(self, node): | ||
if not self._nodes: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for not checking the value of
scaled
here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oooh, great question! I have two answers :). One is pragmatic, and one is conceptual.
The pragmatic answer is, we didn't need it in order for all the tests to pass 😆
The conceptual answer is, this selects signatures that could be converted, but doesn't actually do the conversion. This is a tension that I haven't figured out how to resolve - should selectors return appropriately downsampled signatures, or should they just select signatures that could be downsampled?
I think the right thing to do is to punt this to a new issue for discussion, since I don't think we have any situations in the codebase that depend on answering it right now (see first point, "all tests pass").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I will take care of punting to the new issue, but would appreciate pushback and/or your thoughts!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the approach of selecting sigs that could be converted! ...but not all scaled sigs can be downsampled (desired scaled < sig scaled), right?
😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep. I think this is a great idea for a targeted test, I'll see what I can do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I prefer returning without explicit downsampling, and maybe make another function that uses
select_signatures
to do the actual downsampling.(depending on the application that can be a lot of downsampling without really needing it...)