Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable the addition (and removal?) of signatures from LCA databases. #849

Closed
ctb opened this issue Jan 19, 2020 · 29 comments
Closed

enable the addition (and removal?) of signatures from LCA databases. #849

ctb opened this issue Jan 19, 2020 · 29 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jan 19, 2020

for now, you have to rebuild sourmash lca databases from scratch if you want to change their content, but it should be straightforward to implement both an add/append and a remove.

then, separately, we would have to provide command line options for doing so.

ref #555 (comment).

@ctb
Copy link
Contributor Author

ctb commented May 3, 2020

insert was implemented in #946.

Maybe Index objects should support remove, and then we can add command line options for dealing with that.

ref #949 maybe? and/or have a general CLI submodule for dealing with databases.

@luizirber
Copy link
Member

Maybe Index objects should support remove, and then we can add command line options for dealing with that.

Intriguing. SBTs can support remove too (might need to think how to keep them dense, tho), and for the LinearIndex it's trivial. And for future indices that don't support it, throw an error?

ref #949 maybe? and/or have a general CLI submodule for dealing with databases.

I thought about using index as that CLI submodule, but it would complicate the current use case too much.

Other potential operations in SBTs that would fit a database CLI:

  • migrate (from going from a previous SBT version to a new one)
  • convert (from one database format to another)
  • prepare (from a leaf-only SBT to a full SBT)
  • rescale? (although we can pass a different --scaled during query)
  • for SBTs there is also the potential to reuse the tree structure for other k too. If it was indexed with k=21, all the internal nodes have those k, but the tree structure can be used for k=51 too (but need to rebuild all internal nodes)

@ctb
Copy link
Contributor Author

ctb commented May 8, 2020

could also provide a screen or mask option, which would be useful for benchmarking / hold-one-out studies (I kinda need this for charcoal). see also #985.

@ctb
Copy link
Contributor Author

ctb commented May 13, 2020

for testing/evaluation purposes where performance is not a big issue, using search --containment instead of gather and then removing matches based on md5sum should address my needs for e.g. hold-one-out.

@ctb
Copy link
Contributor Author

ctb commented Jul 8, 2020

random thoughts from today. for testing/evaluation --

  • lca databases can easily mask a particular idx! super easy.
  • in situations where you have a small set of signatures you want to hold out, you can build the SBT without those signatures and just include all but one on the command line.

@nmb85
Copy link

nmb85 commented Jul 31, 2020

This would be very helpful for me right now. I have a really strange issue with an sbt.zip db that I built: although I am only passing unique signatures to sourmash index (as confirmed with sort | uniq -c run on the input file that contains the paths to all the signatures: genomes/all/GCA/gca_latest_genomic_sig_minus1.txt), I get astronomical duplication of some signatures in my db:

With v3.4.1rc1, I set up my db as follows:
sourmash index -k 31 /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip /data/ncbi/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT/GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.sig --from-file genomes/all/GCA/gca_latest_genomic_sig_minus1.txt

After setting up my db and finding that sourmash gather works, I run:

sourmash sig describe /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip > gca_genbank.sbt.zip.manifest #put a list of the signatures into a text file

grep "^signature:" gca_genbank.sbt.zip.manifest | awk '{print $2}' | sort | uniq -c | sort -nr > gca_accession.count #get a count of duplicated accession numbers from signature names

(I did the above, because while trying to build an lca db from the signatures in the sbt.zip, sourmash lca index complained that I had signatures with duplicate names and failed.)

And if I head the file with counts of the accessions (calculated by uniq -c), I get this:

   1261 MG522850.1
   1231 KY084478.1
    713 MF677845.1
    317 KX266637.1
    268 ODAX01000001.1
    161 MRLE01000001.1
    160 AAITDJ010000001.1
    132 AAJDZY010000001.1
    107 CP018035.1
     98 CP013906.1

That is, in two cases - accession # MG522850.1 and KY084478.1 - more than 1,000 replicated signatures. There are 14,680 signatures that are at least duplicated, and the remaining 707,167 are unique signatures, for a total of 721,847 signatures (from the NCBI genomes GCA* db).

So, it would be really handy to be able to remove signatures, especially en masse or with reg-ex after loading the sbt.zip db into memory once. Can you please suggest an efficient way to do this with the Python API? Thanks!

@ctb
Copy link
Contributor Author

ctb commented Jul 31, 2020

first of all - wow, that's a big database 👍

second of all - where did the duplicate accessions come from!? either sbt index is adding signatures multiple times, or the duplicate accessions are in the input signatures... you checked the filenames, but do you know if there are different files with the same accession?

third, a stopgap measure could be to do sourmash sig cat --unique which will eliminate duplicate signatures based on their hash content.

@luizirber
Copy link
Member

And if I head the file with counts of the accessions (calculated by uniq -c), I get this:

   1261 MG522850.1
   1231 KY084478.1
    713 MF677845.1
    317 KX266637.1
    268 ODAX01000001.1
    161 MRLE01000001.1
    160 AAITDJ010000001.1
    132 AAJDZY010000001.1
    107 CP018035.1
     98 CP013906.1

That is, in two cases - accession # MG522850.1 and KY084478.1 - more than 1,000 replicated signatures. There are 14,680 signatures that are at least duplicated, and the remaining 707,167 are unique signatures, for a total of 721,847 signatures (from the NCBI genomes GCA* db).

Looking at these names... are you using --name-from-first to build the signatures? I moved away from that in sourmash_databases because it generated a lot of confusion...

For debugging purposes, can you also check the head -1 for all the FASTA files, and see if they are all unique? And, if they aren't, if they match the numbers above?

@nmb85
Copy link

nmb85 commented Jul 31, 2020

Hi, so overnight I ran this:

time while read P; do zcat $P | head -1; done < gca_latest_genomic_fna.txt > gca_latest_genomic_fna_accessions.txt

where gca_latest_genomic_fna.txt is a textfile containing >750k fna.gz files over which I ran sourmash compute. After 3 hours that finished and I looked at the contents of the result this morning and got:

cut -c 2- gca_latest_genomic_fna_accessions.txt | awk '{print $1}' | sort | uniq -c | sort -nr | head -100
      3 EU339936.1
      2 X68295.1
      2 MUGM01000001.1
      2 MK380015.1
      2 MK380014.1
      2 MK072507.1
      2 MK072498.1
      2 MK072489.1
      2 MK072437.1
      2 MK072383.1
      2 MK072332.1
      2 MK072243.1
      2 MK072199.1
      2 MK072132.1
      2 MK072066.1
      2 MK072042.1
      2 MK071998.1
      2 MK071979.1
      2 MH614262.1
      2 MH614261.1
      2 MH171300.1
      2 MG747435.1
      2 MF477236.1
      2 LYWJ01000001.1
      2 LN879430.1
      2 LN850107.1
      2 LN813019.1
      2 KJ944830.1
      2 JX181822.1
      2 JX181814.1
      2 DQ347950.1
      2 CP009805.1
      2 CP009075.1
      2 CP004500.1
      2 CP004495.1
      2 CP004493.1
      2 CP004492.1
      2 CP004484.1
      2 CP004479.1
      2 CP004472.1
      2 CP004468.1
      2 CP004466.1
      2 CP004431.1
      2 CM000462.1
      2 AP018280.1
      1 Z98198.1
      1 Z97873.1
      1 Z86099.2
      1 Z73124.1
      1 Z69910.1
      1 Z69620.1
      1 Z66548.1
      1 Z48731.1
      1 Z48630.1
      1 Z48506.1
      1 Z48182.1
      1 Z47794.1
      1 Z46351.1
      1 Z26920.1
      1 Z25771.1
      1 Z24758.1
      1 Z21647.1
      1 Z18946.1
      1 Z11866.1
      1 Z11128.1
      1 Y18263.1
      1 Y17873.1
      1 Y17017.1
      1 Y17014.1
      1 Y16780.1
      1 Y16627.1
      1 Y16104.1
      1 Y15937.2
      1 Y15936.2
      1 Y15934.1
      1 Y15176.1
      1 Y15175.1
      1 Y15174.1
      1 Y15173.1
      1 Y15034.1
      1 Y14874.1
      1 Y14700.1
      1 Y14570.1
      1 Y14365.1
      1 Y13918.1
      1 Y13463.1
      1 Y13184.1
      1 Y13051.1
      1 Y12083.1
      1 Y11604.1
      1 Y11530.1
      1 Y11099.1
      1 Y11097.1
      1 Y11023.2
      1 Y10973.1
      1 Y10237.1
      1 Y09921.1
      1 Y09854.1
      1 Y09762.1
      1 Y09598.1

So there are a few duplicated accession numbers in the fna.gz files, but not thousands of them.

Next step is running sourmash sig describe over the individual sigs computed by sourmash compute on each of the individual fna.gz files. That will take a few hours and I may be away from my local server for a week before I can report results from that check.

@nmb85
Copy link

nmb85 commented Jul 31, 2020

Re:

Looking at these names... are you using --name-from-first to build the signatures? I moved away from that in sourmash_databases because it generated a lot of confusion...

Yes, I ran with --name-from-first.

Here is my sourmash compute command:
cat /data/ncbi/genomes/all/GCA/gca_latest_genomic_fna.txt | parallel -j 60 'sourmash compute -k 21,31,51 -f --name-from-first --track-abundance --scaled 1000 --output "{= s:\.[^.]+$::;s:\.[^.]+$::; =}".sig {}'

@nmb85
Copy link

nmb85 commented Jul 31, 2020

Well, one thing seems clear; I'm not using current best practices. A lot of what I've done is based on what I read in the docs here:
https://sourmash.readthedocs.io/en/latest/

and the walkthrough here:
https://github.com/dib-lab/2018-ncbi-lineages

but it seems like I should be following:
https://github.com/dib-lab/sourmash_databases

Does that seem accurate?

@luizirber
Copy link
Member

This is not your fault... We didn't move the new way from sourmash_databases into the docs, and the PR isn't even merged there...

But most of the changes in the PR are because --name-from-first makes it hard to figure out taxonomy later. I modified sourmash_databases to set a custom name based on "Accession, name, strain, assembly" because then we have consistent accessions to map to taxonomy IDs later (and the other info is useful for display). And because it is now based on the assembly_summary.txt it is also easier to rebuild/update later (because you just need an assembly_summary.txt to get all the info/download all the genomes)

@nmb85
Copy link

nmb85 commented Jul 31, 2020

Re:

And because it is now based on the assembly_summary.txt it is also easier to rebuild/update later (because you just need an assembly_summary.txt to get all the info/download all the genomes)

This is an excellent idea. I started using that file myself to build my own mapping file rather than the 2018-ncbi... repository walkthrough, because all the data is right there in a single file, separate columns: accession, taxid, etc

Re:

This is not your fault... We didn't move the new way from sourmash_databases into the docs, and the PR isn't even merged there...

I don't mind one way or the other if it's my fault; I make lots of mistakes and I'm happy to throw more onto the pile, haha. I just wanted you to have some idea of what "fresh eyes" see when they come to sourmash. It might help with docs, etc. I will start working from sourmash_databases and see what I can manage...

Thanks for your help!!

@nmb85
Copy link

nmb85 commented Aug 11, 2020

Hello, I have the results from doing the following (from above):

Next step is running sourmash sig describe over the individual sigs computed by sourmash compute on each of the individual fna.gz files. That will take a few hours and I may be away from my local server for a week before I can report results from that check.

3       EU339936.1 Rhynchosia golden mosaic virus strain 1045 segment DNA-A, complete sequence
2       X68295.1 Mycobacteriophage I3 gene for structural protein (70 KD)
2       MUGM01000001.1 Calypte anna isolate BGI_N300 000239F, whole genome shotgun sequence
2       MK380015.1 Klebsiella phage Kund-ULIP47, complete genome
2       MK380014.1 Klebsiella phage K1-ULIP33, complete genome
2       MK072507.1 Sylvanvirus sp. clone Sylvanvirus_1 genomic sequence
2       MK072498.1 Solumvirus sp. clone Solumvirus_1 genomic sequence
2       MK072489.1 Solivirus sp. clone Solivirus_1 genomic sequence
2       MK072437.1 Satyrvirus sp. clone Satyrvirus_1 genomic sequence
2       MK072383.1 Hyperionvirus sp. clone Hyperionvirus_1 genomic sequence
2       MK072332.1 Homavirus sp. clone Homavirus_1 genomic sequence
2       MK072243.1 Harvfovirus sp. clone Harvfovirus_1 genomic sequence
2       MK072199.1 Gaeavirus sp. clone Gaeavirus_1 genomic sequence
2       MK072132.1 Faunusvirus sp. clone Faunusvirus_1 genomic sequence
2       MK072066.1 Edafosvirus sp. clone Edafosvirus_1 genomic sequence
2       MK072042.1 Dasosvirus sp. clone Dasosvirus_1 genomic sequence
2       MK071998.1 Barrevirus sp. clone Barrevirus_1 genomic sequence
2       MK071979.1 Terrestrivirus sp. clone Terrestrivirus_1 genomic sequence
2       MH614262.1 Bird's-foot trefoil nucleorhabdovirus isolate LC, complete genome
2       MH614261.1 Bird's-foot trefoil enamovirus 1 isolate LC, complete genome
2       MH171300.1 Marine RNA virus BC-4 non-structural polyprotein and structural polyprotein genes, complete cds
2       MF477236.1 Nocardia phage NTR1, complete genome
2       LYWJ01000001.1 Homo sapiens isolate AK1 A_00590001_001, whole genome shotgun sequence
2       LN879430.1 Herbinix sp. SD1D genome assembly SD1D, chromosome : I
2       LN850107.1 Alloactinosynnema sp. L-07 genome assembly Alloactinosynnema sp. L-07, chromosome : I
2       LN813019.1 Halomonas sp. R57-5 genome assembly HalomonasR57-5, chromosome : I
2       KJ944830.1 Pseudoalteromonas phage B8b, partial genome
2       JX181822.1 Salmonella phage SPT-1, partial genome
2       JX181814.1 Salmonella phage SBA-1781, partial genome
2       DQ347950.1 Rhynchosia golden mosaic virus isolate Soybean 1068 segment DNA-A, complete sequence
2       CP009805.1 Botrytis cinerea B05.10 chromosome BCIN01, complete sequence
2       CP009075.1 Verticillium dahliae JR2 chromosome 1, complete sequence
2       CP004500.1 Saccharomyces cerevisiae YJM1592 chromosome I genomic sequence
2       CP004495.1 Saccharomyces cerevisiae YJM1526 chromosome I genomic sequence
2       CP004493.1 Saccharomyces cerevisiae YJM1478 chromosome I genomic sequence
2       CP004492.1 Saccharomyces cerevisiae YJM1477 chromosome I genomic sequence
2       CP004484.1 Saccharomyces cerevisiae YJM1434 chromosome I genomic sequence
2       CP004479.1 Saccharomyces cerevisiae YJM1415 chromosome I genomic sequence
2       CP004472.1 Saccharomyces cerevisiae YJM1387 chromosome I genomic sequence
2       CP004468.1 Saccharomyces cerevisiae YJM1381 chromosome I genomic sequence
2       CP004466.1 Saccharomyces cerevisiae YJM1355 chromosome I genomic sequence
2       CP004431.1 Saccharomyces cerevisiae YJM683 chromosome I genomic sequence
2       CM000462.1 Homo sapiens chromosome 1, whole genome shotgun sequence
2       AP018280.1 Calothrix sp. NIES-4101 DNA, complete genome
1       Z98198.1 Peanut stunt virus satellite RNA (P4-satRNA)
1       Z97873.1 Beet soil-borne virus RNA1 for 145 kDa protein, and 204 kDa protein
1       Z86099.2 Herpes simplex virus type 2 (strain HG52), complete genome
1       Z73124.1 Sweetpotato mild mottle virus mRNA for polyprotein
1       Z69910.1 Groundnut rosette virus complete genome, strain MC1
1       Z69620.1 European brown hare syndrome virus RNA
1       Z66548.1 Puumala virus segment L, genomic RNA, strain Sotkamo
1       Z48731.1 Human immunodeficiency virus type 2 gag, pol, vif, vpx, vpr, tat, rev, nef and env genes
1       Z48630.1 Cocksfoot Mottle Virus genes for polyprotein, RNA dependent RNA polymerase and coat protein
1       Z48506.1 Brome streak mosaic rymovirus polyprotein RNA
1       Z48182.1 Tomato leaf curl virus - Bangalore I V1, V2, C1, C2, C3 and C4 genes
1       Z47794.1 Streptococcus phage Cp-1 DNA, complete genome
1       Z46351.1 Lychnis ringspot virus RNA for beta-A, beta-B, beta-C, beta-D proteins
1       Z26920.1 Johnson grass mosaic virus gene for protease 1 and 3, helper component 6K protein, coat protein, nuclear inclusion proteins
1       Z25771.1 Human astrovirus type 1 genes for capsid protein and nonstructural protein
1       Z24758.1 Indian cassava mosaic virus encoding AR0 complete CDS
1       Z21647.1 P.asiatica mosaic virus genomic RNA
1       Z18946.1 Mycobacterium phage L5 complete genome
1       Z11866.1 Cladosporium fulvum T-1 virus LTR-retrotransposon encoding homologues to retroviral gag, pol, and env genes
1       Z11128.1 Friend murine leukemia virus FB29 complete genome
1       Y18263.1 Viral hemorrhagic septicemia virus Fil3 complete genome
1       Y17873.1 Thogoto virus genomic RNA for PB2 polymerase subunit
1       Y17017.1 Human T-cell lymphotropic virus type 1 DNA, long terminal repeat (patient Lib2)
1       Y17014.1 Human T-cell lymphotropic virus type 1 DNA, long terminal repeat (patient Efe1)
1       Y16780.1 variola minor virus complete genome
1       Y16627.1 Ovine enzootic nasal tumour virus complete sequence
1       Y16104.1 Physalis mottle tymovirus OP, RP, CP genes
1       Y15937.2 Sheep astrovirus, complete genome, genomic RNA
1       Y15936.2 Avastrovirus 1 complete genome, genomic RNA

The signature counts from the individual signatures computed with sourmash compute match the counts of the fna files from which they were computed. So it's when I run sourmash index that the signatures are massively replicated:

sourmash index -k 31 /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip /data/ncbi/genomes/all/GCA/000/001/215/GCA_000001215.4_Release_6_plus_ISO1_MT/GCA_000001215.4_Release_6_plus_ISO1_MT_genomic.sig --from-file genomes/all/GCA/gca_latest_genomic_sig_minus1.txt

Thanks again for helping to solve this. Regarding using sourmash sig cat --unique, that is a great idea (and a handy cli tool), but I'm afraid that the process is killed before it completes. I'll try another approach if you can suggest one, try to figure something out with the Python API, or I'll wait to see what happens with the sourmash index command in future releases.

@nmb85
Copy link

nmb85 commented Aug 11, 2020

I actually decided to try running sourmash compute without --name-from-first to see if @luizirber's instinct is correct. I will try to index the signatures named for the fna files from which they were computed next.

@luizirber
Copy link
Member

Thanks for digging down! I will check the DBs I built in https://github.com/dib-lab/sourmash_databases and see if they have duplicates too... If they do, there is a pretty large bug somewhere 😨

@nmb85
Copy link

nmb85 commented Aug 14, 2020

Hi, @luizirber, I'm happy to help where I can. So, I've gotten the results from running sourmash compute without the --name-from-first option over the original fasta files for all genbank assemblies using v3.4.1rc1 as in all above comments.
Here is that command:

cat /data/ncbi/genomes/all/GCA/gca_latest_genomic_fna.txt | parallel -j 60 'sourmash compute -k 21,31,51 --track-abundance --scaled 1000 --output "{= s:\.[^.]+$::;s:\.[^.]+$::; =}"_acc.sig {}'

Then I indexed those signatures (each named for the fna.gz filename from which the signature was computed) as follows:

time sourmash index -k 31 /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip /data/ncbi/genomes/all/GCA/013/183/675/GCA_013183675.1_ASM1318367v1/GCA_013183675.1_ASM1318367v1_genomic_acc.sig --from-file /data/ncbi/genomes/all/GCA/filename_sig_list_minus1.txt

And 9 hours after the indexing finishes, I run sourmash sig describe to get a manifest of the signatures in the db and then run a shell pipeline to count the number of signatures:

sourmash sig describe /data/ncbi/genomes/all/GCA/gca_genbank.sbt.zip > filename_sig.sbt.zip.manifest

grep "^signature:" filename_sig.sbt.zip.manifest | awk '{print $2}' | sort | uniq -c | sort -nr > new_gca_accession.count #get counts of signatures

And the result (according to the counts, anyhow) are identical to what I saw when I ran sourmash compute with the --name-from-first option and then indexed those signatures. So I don't believe that the --name-from-first option is responsible for this behavior. Here are the first few lines from the signature count file, new_gca_accession.count:

1261 /data/ncbi/genomes/all/GCA/008/800/355/GCA_008800355.1_ASM880035v1/GCA_008800355.1_ASM880035v1_genomic.fna.gz
   1231 /data/ncbi/genomes/all/GCA/004/066/835/GCA_004066835.1_ASM406683v1/GCA_004066835.1_ASM406683v1_genomic.fna.gz
    713 /data/ncbi/genomes/all/GCA/004/066/735/GCA_004066735.1_ASM406673v1/GCA_004066735.1_ASM406673v1_genomic.fna.gz
    317 /data/ncbi/genomes/all/GCA/002/536/245/GCA_002536245.1_ASM253624v1/GCA_002536245.1_ASM253624v1_genomic.fna.gz
    268 /data/ncbi/genomes/all/GCA/900/219/735/GCA_900219735.1_SRS053214/GCA_900219735.1_SRS053214_genomic.fna.gz
    161 /data/ncbi/genomes/all/GCA/010/078/015/GCA_010078015.1_ASM1007801v1/GCA_010078015.1_ASM1007801v1_genomic.fna.gz
    160 /data/ncbi/genomes/all/GCA/008/759/475/GCA_008759475.1_PDT000593948.1/GCA_008759475.1_PDT000593948.1_genomic.fna.gz
    132 /data/ncbi/genomes/all/GCA/006/946/905/GCA_006946905.1_PDT000192888.2/GCA_006946905.1_PDT000192888.2_genomic.fna.gz
    107 /data/ncbi/genomes/all/GCA/008/821/865/GCA_008821865.1_ASM882186v1/GCA_008821865.1_ASM882186v1_genomic.fna.gz
     98 /data/ncbi/genomes/all/GCA/008/821/745/GCA_008821745.1_ASM882174v1/GCA_008821745.1_ASM882174v1_genomic.fna.gz
     98 /data/ncbi/genomes/all/GCA/008/378/965/GCA_008378965.1_PDT000583170.1/GCA_008378965.1_PDT000583170.1_genomic.fna.gz
     72 /data/ncbi/genomes/all/GCA/006/441/735/GCA_006441735.1_ASM644173v1/GCA_006441735.1_ASM644173v1_genomic.fna.gz
     64 /data/ncbi/genomes/all/GCA/010/406/965/GCA_010406965.1_PDT000256503.2/GCA_010406965.1_PDT000256503.2_genomic.fna.gz
     60 /data/ncbi/genomes/all/GCA/902/719/495/GCA_902719495.1_12083_7_46/GCA_902719495.1_12083_7_46_genomic.fna.gz
     59 /data/ncbi/genomes/all/GCA/009/948/425/GCA_009948425.1_ASM994842v1/GCA_009948425.1_ASM994842v1_genomic.fna.gz
     57 /data/ncbi/genomes/all/GCA/006/164/165/GCA_006164165.1_PDT000241615.2/GCA_006164165.1_PDT000241615.2_genomic.fna.gz
     55 /data/ncbi/genomes/all/GCA/005/536/435/GCA_005536435.1_PDT000333183.1/GCA_005536435.1_PDT000333183.1_genomic.fna.gz
     53 /data/ncbi/genomes/all/GCA/006/701/085/GCA_006701085.1_PDT000317479.1/GCA_006701085.1_PDT000317479.1_genomic.fna.gz
     52 /data/ncbi/genomes/all/GCA/008/896/965/GCA_008896965.1_PDT000600819.1/GCA_008896965.1_PDT000600819.1_genomic.fna.gz
     52 /data/ncbi/genomes/all/GCA/008/845/205/GCA_008845205.1_PDT000599973.1/GCA_008845205.1_PDT000599973.1_genomic.fna.gz
     52 /data/ncbi/genomes/all/GCA/004/036/575/GCA_004036575.1_ASM403657v1/GCA_004036575.1_ASM403657v1_genomic.fna.gz
     46 /data/ncbi/genomes/all/GCA/001/166/405/GCA_001166405.1_8525_3_42/GCA_001166405.1_8525_3_42_genomic.fna.gz
     44 /data/ncbi/genomes/all/GCA/008/354/005/GCA_008354005.1_PDT000581113.1/GCA_008354005.1_PDT000581113.1_genomic.fna.gz
     43 /data/ncbi/genomes/all/GCA/004/084/815/GCA_004084815.1_ASM408481v1/GCA_004084815.1_ASM408481v1_genomic.fna.gz
     42 /data/ncbi/genomes/all/GCA/009/903/655/GCA_009903655.1_ASM990365v1/GCA_009903655.1_ASM990365v1_genomic.fna.gz
     42 /data/ncbi/genomes/all/GCA/004/066/535/GCA_004066535.1_ASM406653v1/GCA_004066535.1_ASM406653v1_genomic.fna.gz
     41 /data/ncbi/genomes/all/GCA/008/721/655/GCA_008721655.1_PDT000592674.1/GCA_008721655.1_PDT000592674.1_genomic.fna.gz
     39 /data/ncbi/genomes/all/GCA/012/004/665/GCA_012004665.1_PDT000717166.1/GCA_012004665.1_PDT000717166.1_genomic.fna.gz
     38 /data/ncbi/genomes/all/GCA/900/052/195/GCA_900052195.1_10678_7_85/GCA_900052195.1_10678_7_85_genomic.fna.gz
     38 /data/ncbi/genomes/all/GCA/008/612/285/GCA_008612285.1_PDT000070042.2/GCA_008612285.1_PDT000070042.2_genomic.fna.gz
     38 /data/ncbi/genomes/all/GCA/008/001/925/GCA_008001925.1_PDT000049072.2/GCA_008001925.1_PDT000049072.2_genomic.fna.gz
     38 /data/ncbi/genomes/all/GCA/006/051/815/GCA_006051815.1_PDT000207254.2/GCA_006051815.1_PDT000207254.2_genomic.fna.gz
     38 /data/ncbi/genomes/all/GCA/001/117/225/GCA_001117225.1_8447_8_32/GCA_001117225.1_8447_8_32_genomic.fna.gz
     37 /data/ncbi/genomes/all/GCA/004/647/945/GCA_004647945.1_PDT000102292.2/GCA_004647945.1_PDT000102292.2_genomic.fna.gz
     37 /data/ncbi/genomes/all/GCA/002/653/475/GCA_002653475.1_ASM265347v1/GCA_002653475.1_ASM265347v1_genomic.fna.gz
     36 /data/ncbi/genomes/all/GCA/900/052/335/GCA_900052335.1_10625_3_19/GCA_900052335.1_10625_3_19_genomic.fna.gz
     36 /data/ncbi/genomes/all/GCA/006/309/005/GCA_006309005.1_PDT000274133.2/GCA_006309005.1_PDT000274133.2_genomic.fna.gz
     35 /data/ncbi/genomes/all/GCA/006/441/215/GCA_006441215.1_ASM644121v1/GCA_006441215.1_ASM644121v1_genomic.fna.gz
     34 /data/ncbi/genomes/all/GCA/005/696/255/GCA_005696255.1_PDT000392782.1/GCA_005696255.1_PDT000392782.1_genomic.fna.gz
     34 /data/ncbi/genomes/all/GCA/002/596/765/GCA_002596765.1_ASM259676v1/GCA_002596765.1_ASM259676v1_genomic.fna.gz
     32 /data/ncbi/genomes/all/GCA/001/364/375/GCA_001364375.1_10608_2_26/GCA_001364375.1_10608_2_26_genomic.fna.gz
     31 /data/ncbi/genomes/all/GCA/010/236/905/GCA_010236905.1_PDT000445952.1/GCA_010236905.1_PDT000445952.1_genomic.fna.gz
     31 /data/ncbi/genomes/all/GCA/009/628/015/GCA_009628015.1_PDT000030926.2/GCA_009628015.1_PDT000030926.2_genomic.fna.gz
     31 /data/ncbi/genomes/all/GCA/008/845/845/GCA_008845845.1_PDT000114462.2/GCA_008845845.1_PDT000114462.2_genomic.fna.gz
     31 /data/ncbi/genomes/all/GCA/008/685/245/GCA_008685245.1_ASM868524v1/GCA_008685245.1_ASM868524v1_genomic.fna.gz
     31 /data/ncbi/genomes/all/GCA/008/258/605/GCA_008258605.1_PDT000576072.1/GCA_008258605.1_PDT000576072.1_genomic.fna.gz
     30 /data/ncbi/genomes/all/GCA/010/400/065/GCA_010400065.1_PDT000256434.2/GCA_010400065.1_PDT000256434.2_genomic.fna.gz
     30 /data/ncbi/genomes/all/GCA/008/942/165/GCA_008942165.1_PDT000144708.2/GCA_008942165.1_PDT000144708.2_genomic.fna.gz
     30 /data/ncbi/genomes/all/GCA/008/605/865/GCA_008605865.1_ASM860586v1/GCA_008605865.1_ASM860586v1_genomic.fna.gz
     30 /data/ncbi/genomes/all/GCA/005/567/835/GCA_005567835.1_ASM556783v1/GCA_005567835.1_ASM556783v1_genomic.fna.gz
     30 /data/ncbi/genomes/all/GCA/004/391/645/GCA_004391645.1_PDT000004314.4/GCA_004391645.1_PDT000004314.4_genomic.fna.gz
     29 /data/ncbi/genomes/all/GCA/902/719/935/GCA_902719935.1_12083_7_86/GCA_902719935.1_12083_7_86_genomic.fna.gz
     29 /data/ncbi/genomes/all/GCA/900/052/495/GCA_900052495.1_10678_6_12/GCA_900052495.1_10678_6_12_genomic.fna.gz
     29 /data/ncbi/genomes/all/GCA/008/513/255/GCA_008513255.1_PDT000586050.1/GCA_008513255.1_PDT000586050.1_genomic.fna.gz
     29 /data/ncbi/genomes/all/GCA/005/408/385/GCA_005408385.1_ASM540838v1/GCA_005408385.1_ASM540838v1_genomic.fna.gz
     29 /data/ncbi/genomes/all/GCA/003/109/665/GCA_003109665.1_ASM310966v1/GCA_003109665.1_ASM310966v1_genomic.fna.gz
     28 /data/ncbi/genomes/all/GCA/006/498/755/GCA_006498755.1_PDT000161244.2/GCA_006498755.1_PDT000161244.2_genomic.fna.gz
     27 /data/ncbi/genomes/all/GCA/010/105/735/GCA_010105735.1_PDT000153679.2/GCA_010105735.1_PDT000153679.2_genomic.fna.gz
     27 /data/ncbi/genomes/all/GCA/008/032/415/GCA_008032415.1_PDT000041808.3/GCA_008032415.1_PDT000041808.3_genomic.fna.gz
     27 /data/ncbi/genomes/all/GCA/006/874/325/GCA_006874325.1_PDT000065880.2/GCA_006874325.1_PDT000065880.2_genomic.fna.gz
     27 /data/ncbi/genomes/all/GCA/006/179/705/GCA_006179705.1_PDT000267758.2/GCA_006179705.1_PDT000267758.2_genomic.fna.gz
     27 /data/ncbi/genomes/all/GCA/004/647/905/GCA_004647905.1_PDT000102281.2/GCA_004647905.1_PDT000102281.2_genomic.fna.gz
     26 /data/ncbi/genomes/all/GCA/010/055/505/GCA_010055505.1_PDT000658019.1/GCA_010055505.1_PDT000658019.1_genomic.fna.gz
     26 /data/ncbi/genomes/all/GCA/008/832/605/GCA_008832605.1_PDT000119810.2/GCA_008832605.1_PDT000119810.2_genomic.fna.gz
     26 /data/ncbi/genomes/all/GCA/005/163/945/GCA_005163945.1_PDT000181854.2/GCA_005163945.1_PDT000181854.2_genomic.fna.gz
     26 /data/ncbi/genomes/all/GCA/003/109/885/GCA_003109885.1_ASM310988v1/GCA_003109885.1_ASM310988v1_genomic.fna.gz
     26 /data/ncbi/genomes/all/GCA/002/650/675/GCA_002650675.1_ASM265067v1/GCA_002650675.1_ASM265067v1_genomic.fna.gz
     25 /data/ncbi/genomes/all/GCA/900/052/795/GCA_900052795.1_10678_6_38/GCA_900052795.1_10678_6_38_genomic.fna.gz
     25 /data/ncbi/genomes/all/GCA/009/937/945/GCA_009937945.1_ASM993794v1/GCA_009937945.1_ASM993794v1_genomic.fna.gz
     25 /data/ncbi/genomes/all/GCA/008/746/075/GCA_008746075.1_PDT000593690.1/GCA_008746075.1_PDT000593690.1_genomic.fna.gz
     25 /data/ncbi/genomes/all/GCA/006/843/125/GCA_006843125.1_PDT000324448.1/GCA_006843125.1_PDT000324448.1_genomic.fna.gz
     25 /data/ncbi/genomes/all/GCA/006/498/895/GCA_006498895.1_PDT000161392.2/GCA_006498895.1_PDT000161392.2_genomic.fna.gz
     25 /data/ncbi/genomes/all/GCA/002/648/635/GCA_002648635.1_ASM264863v1/GCA_002648635.1_ASM264863v1_genomic.fna.gz
     24 /data/ncbi/genomes/all/GCA/008/911/545/GCA_008911545.1_PDT000134773.2/GCA_008911545.1_PDT000134773.2_genomic.fna.gz
     24 /data/ncbi/genomes/all/GCA/008/821/265/GCA_008821265.1_ASM882126v1/GCA_008821265.1_ASM882126v1_genomic.fna.gz
     24 /data/ncbi/genomes/all/GCA/008/456/745/GCA_008456745.1_PDT000359439.1/GCA_008456745.1_PDT000359439.1_genomic.fna.gz
     24 /data/ncbi/genomes/all/GCA/006/282/275/GCA_006282275.1_PDT000309175.2/GCA_006282275.1_PDT000309175.2_genomic.fna.gz
     24 /data/ncbi/genomes/all/GCA/006/203/745/GCA_006203745.1_PDT000297560.2/GCA_006203745.1_PDT000297560.2_genomic.fna.gz
     23 /data/ncbi/genomes/all/GCA/902/724/675/GCA_902724675.1_TRF.01.04_HeklaHavn_GL_AD1895/GCA_902724675.1_TRF.01.04_HeklaHavn_GL_AD1895_genomic.fna.gz
     23 /data/ncbi/genomes/all/GCA/900/005/085/GCA_900005085.1_A_England_690_2010/GCA_900005085.1_A_England_690_2010_genomic.fna.gz
     23 /data/ncbi/genomes/all/GCA/010/036/025/GCA_010036025.1_PDT000657209.1/GCA_010036025.1_PDT000657209.1_genomic.fna.gz
     23 /data/ncbi/genomes/all/GCA/008/822/925/GCA_008822925.1_ASM882292v1/GCA_008822925.1_ASM882292v1_genomic.fna.gz
     23 /data/ncbi/genomes/all/GCA/008/721/475/GCA_008721475.1_PDT000592656.1/GCA_008721475.1_PDT000592656.1_genomic.fna.gz
     23 /data/ncbi/genomes/all/GCA/008/441/765/GCA_008441765.1_PDT000313113.2/GCA_008441765.1_PDT000313113.2_genomic.fna.gz
     23 /data/ncbi/genomes/all/GCA/006/766/685/GCA_006766685.1_PDT000332507.1/GCA_006766685.1_PDT000332507.1_genomic.fna.gz
     22 /data/ncbi/genomes/all/GCA/902/719/535/GCA_902719535.1_12083_7_43/GCA_902719535.1_12083_7_43_genomic.fna.gz
     22 /data/ncbi/genomes/all/GCA/902/719/445/GCA_902719445.1_12083_7_27/GCA_902719445.1_12083_7_27_genomic.fna.gz
     22 /data/ncbi/genomes/all/GCA/010/476/085/GCA_010476085.1_PDT000646810.1/GCA_010476085.1_PDT000646810.1_genomic.fna.gz
     22 /data/ncbi/genomes/all/GCA/008/685/755/GCA_008685755.1_ASM868575v1/GCA_008685755.1_ASM868575v1_genomic.fna.gz
     22 /data/ncbi/genomes/all/GCA/006/864/315/GCA_006864315.1_ASM686431v1/GCA_006864315.1_ASM686431v1_genomic.fna.gz
     22 /data/ncbi/genomes/all/GCA/006/426/015/GCA_006426015.1_ASM642601v1/GCA_006426015.1_ASM642601v1_genomic.fna.gz
     22 /data/ncbi/genomes/all/GCA/006/051/595/GCA_006051595.1_PDT000211028.2/GCA_006051595.1_PDT000211028.2_genomic.fna.gz
     22 /data/ncbi/genomes/all/GCA/004/089/355/GCA_004089355.1_ASM408935v1/GCA_004089355.1_ASM408935v1_genomic.fna.gz
     21 /data/ncbi/genomes/all/GCA/008/855/825/GCA_008855825.1_PDT000126325.2/GCA_008855825.1_PDT000126325.2_genomic.fna.gz
     21 /data/ncbi/genomes/all/GCA/002/633/125/GCA_002633125.1_ASM263312v1/GCA_002633125.1_ASM263312v1_genomic.fna.gz
     21 /data/ncbi/genomes/all/GCA/002/536/055/GCA_002536055.1_ASM253605v1/GCA_002536055.1_ASM253605v1_genomic.fna.gz
     21 /data/ncbi/genomes/all/GCA/001/177/465/GCA_001177465.1_6903_5_90/GCA_001177465.1_6903_5_90_genomic.fna.gz
     20 /data/ncbi/genomes/all/GCA/900/197/415/GCA_900197415.1_PEDV_GER_L01059-K07_15-01_2015/GCA_900197415.1_PEDV_GER_L01059-K07_15-01_2015_genomic.fna.gz
     20 /data/ncbi/genomes/all/GCA/900/006/035/GCA_900006035.1_A_England_283_2010/GCA_900006035.1_A_England_283_2010_genomic.fna.gz
     20 /data/ncbi/genomes/all/GCA/008/596/585/GCA_008596585.1_PDT000587798.1/GCA_008596585.1_PDT000587798.1_genomic.fna.gz

To speed up my troubleshooting, I subset the 14,000+ individual signature files that were giving replicated signatures in the sbt db and then ran sourmash index and sourmash sig describe and counted the signatures as above, to see if I could replicate the behavior with a more manageable dataset, but I could not. The 14,000+ signatures that were replicated when I ran sourmash index on the 750,000+ signatures from the genbank assembly database were not replicated when I ran sourmash index on just that subset. So, something about running sourmash index on the full 750,000+ signatures is causing this replication.

Any recommendations? Should I properly script up a test to subset the signatures @ 375,000 signatures etc, to see if there is a threshold number of signatures that causes this behavior?

@luizirber
Copy link
Member

luizirber commented Aug 14, 2020

Yup, seems like we have a huge bug =(

Any recommendations?

Thanks for posting the commands, I'm also counting the number of signatures in the largest SBT I have at the moment. While that's running, some potential debugging ideas...

  • creating some very small signatures and internal nodes (like, one hash per sig, -x 1 for internal nodes). The insertion process doesn't really care about the content of the signatures, or the size of the internal nodes. (of course, we can't debug search or gather, but since we are only checking the SBT structure...)

  • Write a Python script that creates an SBT and insert the signatures, avoiding the CLI/all the other processing. (this one is for figuring out if the problem is in the SBT code, or the 'supporting' code around it)

Should I properly script up a test to subset the signatures @ 375,000 signatures etc, to see if there is a threshold number of signatures that causes this behavior?

That might be the case, but I don't remember any conditions that would trigger after a specific number of sigs is inserted... But as we saw in this issue, my gut feeling is probably wrong =P

UPDATE: yup, also seeing duplicated sigs in the DB I built. sigh.

@luizirber
Copy link
Member

I started https://github.com/luizirber/2020-08-14-debug-sbt for tracking this. Now building a mock SBT with sig names from genbank-bacteria, with 652k genomes.

@nmb85
Copy link

nmb85 commented Aug 14, 2020

Re:

UPDATE: yup, also seeing duplicated sigs in the DB I built. sigh.

and

I started https://github.com/luizirber/2020-08-14-debug-sbt for tracking this. Now building a mock SBT with sig names from genbank-bacteria, with 652k genomes.

Thanks for tackling this. It's probably not your favorite task :/

Re:

creating some very small signatures and internal nodes (like, one hash per sig, -x 1 for internal nodes). The insertion process doesn't really care about the content of the signatures, or the size of the internal nodes. (of course, we can't debug search or gather, but since we are only checking the SBT structure...)

I am generating single-hash signatures for my whole dataset now. Will try running sourmash index probably by this evening and get results tomorrow morning.

Will move future comments to https://github.com/luizirber/2020-08-14-debug-sbt

@luizirber
Copy link
Member

Thanks for tackling this. It's probably not your favorite task :/

More like "not the task I should be doing" 😬
(it is actually quite fun, 🕵️ work!)

I am generating single-hash signatures for my whole dataset now. Will try running sourmash index probably by this evening and get results tomorrow morning.

It is probably easier to generate with the API (like I did in https://github.com/luizirber/2020-08-14-debug-sbt), but the result of that is... no duplicates 👀

So! There is something weird going on with the --from-file parsing, what is being fed to the SBT construction...

@nmb85
Copy link

nmb85 commented Aug 14, 2020

More like "not the task I should be doing" 😬

Thank you!!! 🤩

It is probably easier to generate with the API (like I did in https://github.com/luizirber/2020-08-14-debug-sbt), but the result of that is... no duplicates 👀

Okay, I'll use the Python API to build my SBT, following your example.

So! There is something weird going on with the --from-file parsing, what is being fed to the SBT construction...

Yes, the parsing for --from-file is ... awkward ... besides that: #1066

@ctb

This comment was marked as resolved.

@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2020

please file updates over in #1171!

@nmb85
Copy link

nmb85 commented Aug 14, 2020

Righto 😊

@ctb
Copy link
Contributor Author

ctb commented Aug 14, 2020 via email

@ctb
Copy link
Contributor Author

ctb commented Apr 21, 2021

#1477 could add support for "masking" arbitrary signatures from search and gather.

@ctb
Copy link
Contributor Author

ctb commented Apr 23, 2021

see also #433.

@ctb
Copy link
Contributor Author

ctb commented May 4, 2022

masking of signatures was added at CLI in #1871.

Most of the other things we discuss in here have also been resolved elsewhere 🎉

@ctb ctb closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants