Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to properly gather using an sbt as input? #1089

Closed
bluegenes opened this issue Jul 8, 2020 · 8 comments
Closed

how to properly gather using an sbt as input? #1089

bluegenes opened this issue Jul 8, 2020 · 8 comments

Comments

@bluegenes
Copy link
Contributor

Using the test data in tests/test-data/prot:

sourmash multigather --query protein.sbt.zip --db protein.sbt.zip --threshold-bp=0 --protein

== This is sourmash version 3.3.2.dev16+gd703b44. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=57 automatically.
When loading query from "protein.sbt.zip"
2 signatures matching ksize and molecule type;
need exactly one. Specify --ksize or --dna, --rna, or --protein.

Or, trying --query-from-file:

sourmash multigather --query-from-file protein.sbt.zip --db protein.sbt.zip --threshold-bp=0

Traceback (most recent call last):
  File "/home/ntpierce/miniconda3/envs/sourmash_dev/bin/sourmash", line 11, in <module>
    load_entry_point('sourmash', 'console_scripts', 'sourmash')()
  File "/home/ntpierce/sourmash/sourmash/__main__.py", line 14, in main
    return mainmethod(args)
  File "/home/ntpierce/sourmash/sourmash/cli/multigather.py", line 49, in main
    return sourmash.commands.multigather(args)
  File "/home/ntpierce/sourmash/sourmash/commands.py", line 718, in multigather
    more_files = sourmash_args.load_file_list_of_signatures(args.query_from_file)
  File "/home/ntpierce/sourmash/sourmash/sourmash_args.py", line 475, in load_file_list_of_signatures
    file_list = [ x.rstrip('\r\n') for x in fp ]
  File "/home/ntpierce/sourmash/sourmash/sourmash_args.py", line 475, in <listcomp>
    file_list = [ x.rstrip('\r\n') for x in fp ]
  File "/home/ntpierce/miniconda3/envs/sourmash_dev/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 12: invalid continuation byte
@ctb
Copy link
Contributor

ctb commented Jul 8, 2020

interesting - the first one should have worked, now that you mention it :). good catch.

--query-from-file takes a text file listing files to load signatures from. So, in this case, you maybe could have put the path to protein.sbt.zip in a text file and passed it to that. Not sure that would have worked either, tho, gotta check!

@bluegenes
Copy link
Contributor Author

Thinking about this a bit more, and also sig selection (#1072)

Would it be useful to enable signature selection by name/identifier via an input file (e.g. pass an sbt via query and a file with a subset of desired signature names via --query-from-file or similar)?

Use case: store all dataset sigs within an sbt (don't keep sigs). If you need to gather a subset of sigs against another database, you could select just the genomes/samples of interest, rather than gathering all sigs. Choosing a completely off the wall example... say we have all GTDB sigs calculated, can we pass in the full database sbts and select signatures from within?

I suppose this would rely on either 1) a standard naming scheme that includes a genome accession/identifier, or 2) a file mapping accessions/ experimental identifiers to a sourmash signature identifier, e.g. md5sum

@ctb
Copy link
Contributor

ctb commented Jul 9, 2020

you can already do this with --query-md5 for a single signature. You're talking about doing this for multigather, I think? Intriguing... One problem is that it is not currently that fast to load/select from 25k sigs in an SBT or an LCA, so this would not necessarily be that efficient.

@bluegenes
Copy link
Contributor Author

bluegenes commented Jul 9, 2020

how does "not that fast" compare with 30mins-2hrs of snakemake DAG load times? And is that something that could be sped up, or likely to stay slow?

I might be exaggerating a little (I haven't explicitly timed all my runs), but you get the idea. I suppose I could get around this by keeping the signature files and using --query-from-file when it's working.

@ctb
Copy link
Contributor

ctb commented Jul 9, 2020 via email

@bluegenes
Copy link
Contributor Author

bluegenes commented Jul 9, 2020

sidenote: I now have an implementation for using the sbt via query or query-from-file working. Will PR after testing

@ctb
Copy link
Contributor

ctb commented Jul 9, 2020

see #1090

@ctb
Copy link
Contributor

ctb commented Jul 9, 2020

some other notes looking at this issue --

  • we should explicitly test that --query-from-file can contain paths to SBTs and LCAs.
  • we should improve the error message when unable to load --query-from-file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants