Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Would a "Directory" Index be useful? #810

Closed
ctb opened this issue Dec 24, 2019 · 11 comments · Fixed by #1406
Closed

Would a "Directory" Index be useful? #810

ctb opened this issue Dec 24, 2019 · 11 comments · Fixed by #1406

Comments

@ctb
Copy link
Contributor

ctb commented Dec 24, 2019

I've been thinking about the idea of a DirectoryIndex that would replace the explicit functionality of directory traversal by encapsulating it within a subclass of Index. It seems like a nice abstraction, and as we develop further filtering and signature selection mechanisms (e.g. chaining) on top of Index objects, it'd be nice to have them apply to directory traversal.

The most basic form of DirectoryIndex would be something that simply traversed the directory every time signatures() was called. We could also imagine adding one that loads the signatures once and/or caches them, but that might get too memory intensive.

Any thoughts?

@ctb
Copy link
Contributor Author

ctb commented May 3, 2020

related, with #648 @luizirber opened up the possibility of exploring zip files full of signatures...

@luizirber
Copy link
Member

Would a file-of-files approach be preferable (or complementary)?

One example: I tried to index an SBT by using the leaf nodes from an existing SBT, but --traverse-directory expects signature files to end in .sig (which is not the case in the SBT storage), and listing all the files in sourmash index is too long (there were 94k files). I ended up having to copy and rename the sigs...

Another benefit of the FoF approach is that you can pass 'dynamically' from the command-line:
$ sourmash index --fof <(ls -1 .sbt.genbank | grep -v internal) would have worked for me (not sure about what argument name to use, tho)

@ctb
Copy link
Contributor Author

ctb commented May 4, 2020

yes for file-of-files approach - this was requested in #662 too.

@ctb
Copy link
Contributor Author

ctb commented Jul 1, 2020

hmm, while #1059 adds file-of-files, we don't test loading the file-of-files from stdin. Good idea.

@ctb
Copy link
Contributor Author

ctb commented Jul 1, 2020

but what I really stopped by to say was, the changes in #1059 have made me realize that something that would be very useful is a better Index subclass for representing collections of signature files, e.g. the results from both file-of-files and directory traversal. In particular, right now when using the generic _load_database command in sourmash_args, there is no way to know what file a particular signature was loaded from.

This caused a bit of pain in command_summarize.py:load_singletons_and_count where I had to do a separate directory traversal to get that info.

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

but what I really stopped by to say was, the changes in #1059 have made me realize that something that would be very useful is a better Index subclass for representing collections of signature files, e.g. the results from both file-of-files and directory traversal. In particular, right now when using the generic _load_database command in sourmash_args, there is no way to know what file a particular signature was loaded from.

This caused a bit of pain in command_summarize.py:load_singletons_and_count where I had to do a separate directory traversal to get that info.

I dug into this a little bit, as part of research for ZipFileLinearIndex refactoring.

  1. In command_summarize.py:load_singletons_and_count, we gather a list of filenames thru traversal and then individually use load_file_as_signatures on them.
  2. This is so that, for each signature, we can provide a tuple (filename, query_sig, hashvals)
  3. This tuple containing filename is then passed to output_results(...) and output_csv(...) so that output containing the filename from which a sig was loaded can be supported, like this:
% sourmash lca summarize --db tests/test-data/lca/delmont-1.lca.json --query tests/test-data/lca/TARA_ASE_MAG_00031.sig 
100.0%   200   Bacteria   tests/test-data/lca/TARA_ASE_MAG_00031.sig:5b438c6c TARA_ASE_MAG_00031
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

one thought is we could support an "originates from" field in signatures, as in "this signature was loaded from <here>"

And maybe that origin should be a URL that can be interpreted nicely for the user, but also provided in machine-readable format in our various output formats?

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

(what this defaults to for internally-created signatures is 🤷 )

@ctb
Copy link
Contributor Author

ctb commented Feb 28, 2021

note that the lazy loading Index issue asks for a way to find the location from which a signature was loaded. convergence! However, that issue needs the location to be outside the SourmashSignature object so that the signature itself doesn't need to be loaded (for lazy loading).

@ctb
Copy link
Contributor Author

ctb commented Mar 8, 2021

DirectoryIndex could be a special case of MultiIndex that provides an appropriate .load function.

@ctb
Copy link
Contributor Author

ctb commented Mar 28, 2021

DirectoryIndex could be a special case of MultiIndex that provides an appropriate .load function.

Fixed in #1406 with function MultiIndex.load_from_path(...).

@ctb
Copy link
Contributor Author

ctb commented Mar 28, 2021

but what I really stopped by to say was, the changes in #1059 have made me realize that something that would be very useful is a better Index subclass for representing collections of signature files, e.g. the results from both file-of-files and directory traversal. In particular, right now when using the generic _load_database command in sourmash_args, there is no way to know what file a particular signature was loaded from.
This caused a bit of pain in command_summarize.py:load_singletons_and_count where I had to do a separate directory traversal to get that info.

Fixed in #1406 with the new MultiIndex.signatures_with_location(...).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants