Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start thinking about a standard selector framework for signature search/compatibility #1072

Closed
ctb opened this issue Jul 2, 2020 · 6 comments
Labels
revisit_me An issue that needs attention and clarification

Comments

@ctb
Copy link
Contributor

ctb commented Jul 2, 2020

Note: Updated in #1524, which contains the unresolved parts of this issue.

#936 added Index.select(ksize=ksize, moltype=moltype, ...) and with #1059 merged we have a standard API for loading piles of signatures, and I keep on coming back to doing cool things with selectors.

this issue replaces #599, which is just about md5sum selectors, with a place for more general discussion.

note that #934 fell apart into a mess of ugly code, and selector frameworks could provide some simplicity here.

this is all just brainstorming without any attempt to make the code work... but I like the idea of providing a few pieces of functionality.

the current situation

Index subclasses support a function select that currently takes ksize and moltype. it behaves differently on LinearIndex and on databases.

  • on LinearIndex, which can contain many different kind of signatures, it applies the selector all signatures and returns a new LinearIndex
  • on SBT and LCA databases (which are restricted to a single moltype/ksize) return themselves if they meet the condition and raise an exception if they don't.

the underlying idea is to be able to say obj.select(<condition>) and have that condition hold for any future uses of obj.

this would dovetail nicely with #198 in terms of supporting richer databases (e.g. multiple ksizes, moltypes, etc.)

some brainstormy thoughts for more selector foo

it'd be nice to have a selector object that could be used to apply partial restrictions to collections, e.g. just ksize selection. That's kind of how it works now in theory, but selectors are not very rich at the moment.


part of the idea is that selectors could be lazy, so that all the conditions could be resolved once, when you actually use the database or collection of signatures.


I really like the idea of method chaining although I'm not 100% sure exactly what that would look like. maybe db.select(moltype='DNA').select(ksize=57)?


we might want to add scaled and num, and then allow selector functions to do the necessary downsampling (or raise objections).

similarly, we could provide MinHash flattening via selector.

(#611 relevant to both ideas)


md5sum is an obvious selector (#599)


would be nice to be able to use a signature or a database (so, a signature or an Index object?) as a selector, so that only compatible signatures/databases are selected. this would probably help resolve #809 / #934 more cleanly :)


you could imagine applying taxonomic filtering via selectors, although that's kind of a different thing conceptually.

@ctb
Copy link
Contributor Author

ctb commented Jul 3, 2020

random thought, having a robust API that raises appropriate exceptions for incompatible signatures could dovetail nicely with a more consistent and informative user experience.

@ctb
Copy link
Contributor Author

ctb commented Mar 4, 2021

side note, it'd be awesome to be able to use accessions as selectors as well as md5sums, because then you could swizzle between ksize/moltypes fearlessly, too.

@ctb
Copy link
Contributor Author

ctb commented Mar 31, 2021

Much progress made in #1420:

  • select now takes ksize, moltype, scaled, num, and containment
  • database loading now uses selectors in a nice generic way
  • partial restrictions are appropriately implemented

Items not tackled in #1420:

  • md5sum, name/accession, and taxonomic ID selectors
  • abundance selection and/or flattening
  • method chaining
  • selection via signature or Index object

(not all of these may be good ideas, either ;)

Once #1420 is merged we should close this issue and create a new issue with the remaining unimplemented ideas.

@ctb
Copy link
Contributor Author

ctb commented Apr 2, 2021

Interesting conversation here about how selectors should work - tl;dr we should have them pick out signatures that can satisfy the conditions, but not actually modify them (by e.g. downsampling).

This dovetails with some stuff I've been thinking about in #1392 where we face a similar choice in searching - when we actually do the comparisons, we need to modify the signatures to match, but in terms of returning signatures, we probably want to return the original unmodified signature.

@ctb
Copy link
Contributor Author

ctb commented Apr 7, 2021

note to self: #1427 suggests we need more select tests, and I think we should take that as an opportunity to write some Python API docs.

@ctb ctb added the revisit_me An issue that needs attention and clarification label May 8, 2021
@ctb
Copy link
Contributor Author

ctb commented May 15, 2021

closing for #1524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
revisit_me An issue that needs attention and clarification
Projects
None yet
Development

No branches or pull requests

1 participant