-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enable the addition (and removal?) of signatures from LCA databases. #849
Comments
Intriguing. SBTs can support remove too (might need to think how to keep them dense, tho), and for the LinearIndex it's trivial. And for future indices that don't support it, throw an error?
I thought about using Other potential operations in SBTs that would fit a database CLI:
|
could also provide a screen or mask option, which would be useful for benchmarking / hold-one-out studies (I kinda need this for charcoal). see also #985. |
for testing/evaluation purposes where performance is not a big issue, using |
random thoughts from today. for testing/evaluation --
|
This would be very helpful for me right now. I have a really strange issue with an sbt.zip db that I built: although I am only passing unique signatures to With v3.4.1rc1, I set up my db as follows: After setting up my db and finding that
(I did the above, because while trying to build an lca db from the signatures in the sbt.zip, And if I
That is, in two cases - accession # MG522850.1 and KY084478.1 - more than 1,000 replicated signatures. There are 14,680 signatures that are at least duplicated, and the remaining 707,167 are unique signatures, for a total of 721,847 signatures (from the NCBI genomes GCA* db). So, it would be really handy to be able to remove signatures, especially en masse or with reg-ex after loading the sbt.zip db into memory once. Can you please suggest an efficient way to do this with the Python API? Thanks! |
first of all - wow, that's a big database 👍 second of all - where did the duplicate accessions come from!? either third, a stopgap measure could be to do |
Looking at these names... are you using For debugging purposes, can you also check the |
Hi, so overnight I ran this:
where
So there are a few duplicated accession numbers in the fna.gz files, but not thousands of them. Next step is running |
Re:
Yes, I ran with --name-from-first. Here is my |
Well, one thing seems clear; I'm not using current best practices. A lot of what I've done is based on what I read in the docs here: and the walkthrough here: but it seems like I should be following: Does that seem accurate? |
This is not your fault... We didn't move the new way from sourmash_databases into the docs, and the PR isn't even merged there... But most of the changes in the PR are because |
Re:
This is an excellent idea. I started using that file myself to build my own mapping file rather than the 2018-ncbi... repository walkthrough, because all the data is right there in a single file, separate columns: accession, taxid, etc Re:
I don't mind one way or the other if it's my fault; I make lots of mistakes and I'm happy to throw more onto the pile, haha. I just wanted you to have some idea of what "fresh eyes" see when they come to sourmash. It might help with docs, etc. I will start working from sourmash_databases and see what I can manage... Thanks for your help!! |
Hello, I have the results from doing the following (from above):
The signature counts from the individual signatures computed with
Thanks again for helping to solve this. Regarding using |
I actually decided to try running |
Thanks for digging down! I will check the DBs I built in https://github.com/dib-lab/sourmash_databases and see if they have duplicates too... If they do, there is a pretty large bug somewhere 😨 |
Hi, @luizirber, I'm happy to help where I can. So, I've gotten the results from running
Then I indexed those signatures (each named for the fna.gz filename from which the signature was computed) as follows:
And 9 hours after the indexing finishes, I run
And the result (according to the counts, anyhow) are identical to what I saw when I ran
To speed up my troubleshooting, I subset the 14,000+ individual signature files that were giving replicated signatures in the sbt db and then ran Any recommendations? Should I properly script up a test to subset the signatures @ 375,000 signatures etc, to see if there is a threshold number of signatures that causes this behavior? |
Yup, seems like we have a huge bug =(
Thanks for posting the commands, I'm also counting the number of signatures in the largest SBT I have at the moment. While that's running, some potential debugging ideas...
That might be the case, but I don't remember any conditions that would trigger after a specific number of sigs is inserted... But as we saw in this issue, my gut feeling is probably wrong =P UPDATE: yup, also seeing duplicated sigs in the DB I built. sigh. |
I started https://github.com/luizirber/2020-08-14-debug-sbt for tracking this. Now building a mock SBT with sig names from genbank-bacteria, with 652k genomes. |
Re:
and
Thanks for tackling this. It's probably not your favorite task :/ Re:
I am generating single-hash signatures for my whole dataset now. Will try running Will move future comments to https://github.com/luizirber/2020-08-14-debug-sbt |
More like "not the task I should be doing" 😬
It is probably easier to generate with the API (like I did in https://github.com/luizirber/2020-08-14-debug-sbt), but the result of that is... no duplicates 👀 So! There is something weird going on with the |
Thank you!!! 🤩
Okay, I'll use the Python API to build my SBT, following your example.
Yes, the parsing for |
This comment was marked as resolved.
This comment was marked as resolved.
please file updates over in #1171! |
Righto 😊 |
On Fri, Aug 14, 2020 at 02:38:29PM -0700, Nathan Brown wrote:
Righto :blush:
;) no worries, but it's always nice to have an issue to close with a PR!
|
#1477 could add support for "masking" arbitrary signatures from search and gather. |
see also #433. |
masking of signatures was added at CLI in #1871. Most of the other things we discuss in here have also been resolved elsewhere 🎉 |
for now, you have to rebuild sourmash lca databases from scratch if you want to change their content, but it should be straightforward to implement both an add/append and a remove.
then, separately, we would have to provide command line options for doing so.
ref #555 (comment).
The text was updated successfully, but these errors were encountered: