Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

longest run of taxonomically confused hashes per taxonomy #4

Open
ctb opened this issue Nov 22, 2022 · 2 comments
Open

longest run of taxonomically confused hashes per taxonomy #4

ctb opened this issue Nov 22, 2022 · 2 comments

Comments

@ctb
Copy link
Owner

ctb commented Nov 22, 2022

I haven't figured out what to call this, but the table below is an incomplete answer to the question:

what’s the largest collection of hashes present in a single genome that leaves you in doubt as to what taxonomic unit it comes from, per given taxon?

For example, from the table below:

  • there is a set of 364 hashes at scaled=10,000 (so, 3.64 MB!?) that belongs to 'GCA_001894475.1 Escherichia coli strain=687, ASM189447v1'.
  • these 364 hashes belong to both this genome and other genome(s)
  • based on just these hashes in combination, within the GDTB taxonomy you cannot distinguish anything more than that these hashes are from within d__Bacteria.

I actually can't figure out what its partner is that is in a different class than E. coli, so let me go to a different row to illustrate the partner aspect -

  • there is a set of 54 hashes at scaled=10,000 (540kb!) that belongs to 'GCA_018658425.1 Candidatus Woesearchaeota archaeon, ASM1865842v1 ', an archaea.
  • all 54 hashes belong both to d__Archaea and d__Bacteria.
  • for example, 14 of these hashes belong to 'GCA_018663345.1 bacterium, ASM1866334v1'

in this case I'd guess it's contamination, but some of the others in the table below might not be.

Anyway, enjoy!

overlap lin name
0 364 d__Bacteria GCA_001894475.1 Escherichia coli strain=687, ASM189447v1
2 261 d__Bacteria;p__Proteobacteria GCF_005503355.1 Sphingomonas sp. 1F27F7B strain=1F27F7B, ASM550335v1
1 251 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria GCF_003669905.1 Pseudomonas aeruginosa strain=Pa1810, ASM366990v1
4 159 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales GCF_018929655.1 Vibrio cholerae O1 strain=11_Lusaka_2018, ASM1892965v1
7 119 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales GCA_003472185.1 Parabacteroides merdae strain=AM14-15, ASM347218v1
14 103 d__Bacteria;p__Firmicutes_A;c__Clostridia GCA_900553485.1 uncultured Clostridium sp., UMGS1619
26 88 d__Bacteria;p__Marinisomatota;c__Marinisomatia;o__Marinisomatales GCA_018698165.1 Candidatus Marinimicrobia bacterium, ASM1869816v1
16 77 d__Bacteria;p__Actinobacteriota;c__Actinomycetia GCF_005889725.1 Nonomuraea zeae strain=DSM 100528, ASM588972v1
11 70 d__Bacteria;p__Firmicutes;c__Bacilli GCF_905311015.1 Bacillus subtilis, NRS6094
15 70 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales GCA_900766145.1 uncultured Oscillospiraceae bacterium, SRS295027_34
9 65 d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria GCA_014359905.1 Hoeflea sp., ASM1435990v1
46 62 d__Archaea;p__Halobacteriota;c__Halobacteria;o__Halobacteriales GCA_005954745.1 Halostella pelagica strain=DL-M4, ASM595474v1
3 54 GCA_018658425.1 Candidatus Woesearchaeota archaeon, ASM1865842v1
6 49 d__Bacteria;p__Bacteroidota;c__Bacteroidia GCA_002256395.1 Bacteroidetes bacterium B1(2017), ASM225639v1
19 38 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Xanthomonadales GCF_001314305.1 Stenotrophomonas acidaminiphila strain=ZAC14D2_NAIMI4_2, ASM131430v1
8 37 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Burkholderiales GCA_903833455.1 uncultured proteobacterium, freshwater MAG --- MJ120716B_bin-425
5 34 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales GCA_002389265.1 Gammaproteobacteria bacterium UBA4475, ASM238926v1
24 34 d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales GCF_008764375.1 Bacillus safensis strain=DE0105, FS22
56 31 d__Bacteria;p__Desulfobacterota GCA_009993185.1 Deltaproteobacteria bacterium, ASM999318v1
40 31 d__Bacteria;p__Bacteroidota GCA_903878245.1 uncultured Bacteroidales bacterium, freshwater MAG --- Ja1_bin-1678
10 31 d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales GCF_018129525.1 Bradyrhizobium denitrificans strain=SZCCT0094, ASM1812952v1
25 29 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Actinomycetales GCA_012927515.1 Cellulomonas sp., ASM1292751v1
81 26 d__Bacteria;p__Verrucomicrobiota;c__Kiritimatiellae;o__RFP12 GCA_017509565.1 Kiritimatiellae bacterium, ASM1750956v1
17 24 d__Bacteria;p__Actinobacteriota GCF_015560095.1 Bifidobacterium adolescentis strain=1001270J_160509_E8, ASM1556009v1
58 21 d__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhodospirillales GCA_018654935.1 Rhodospirillales bacterium, ASM1865493v1
29 21 d__Bacteria;p__Verrucomicrobiota GCA_903961625.1 uncultured Victivallales bacterium, freshwater MAG --- Loc090907-8-6m_bin-024
39 20 d__Bacteria;p__Acidobacteriota;c__Acidobacteriae GCA_003224475.1 Acidobacteria bacterium, ASM322447v1
82 19 d__Bacteria;p__Chloroflexota;c__Dehalococcoidia GCA_002720365.1 Chloroflexi bacterium, ASM272036v1
96 19 d__Bacteria;p__Proteobacteria;c__Magnetococcia;o__Magnetococcales GCA_015231925.1 Magnetococcales bacterium, ASM1523192v1
20 18 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Mycobacteriales GCA_902805565.1 uncultured Corynebacteriales bacterium, AVDCRST-MAG41
36 18 d__Bacteria;p__Actinobacteriota;c__Coriobacteriia;o__Coriobacteriales GCA_900548495.1 uncultured Collinsella sp., UMGS1095
64 17 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Acetivibrionales GCF_000015865.1 Hungateiclostridium thermocellum ATCC 27405 strain=ATCC 27405, ASM1586v1
28 17 d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiae GCA_018667255.1 Opitutae bacterium, ASM1866725v1
86 16 d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiae;o__Pedosphaerales GCA_016235585.1 Verrucomicrobia bacterium, ASM1623558v1
94 15 d__Bacteria;p__Methylomirabilota;c__Methylomirabilia GCA_016187735.1 candidate division NC10 bacterium, ASM1618773v1
87 14 d__Bacteria;p__Firmicutes;c__Bacilli;o__Paenibacillales GCF_013337105.1 Paenibacillus sp. JMULE4 strain=JMULE4, ASM1333710v1
21 14 d__Bacteria;p__Firmicutes;c__Bacilli;o__Erysipelotrichales GCA_900555595.1 uncultured Solobacterium sp., UMGS1844
83 14 d__Bacteria;p__Verrucomicrobiota;c__Verrucomicrobiae;o__Opitutales GCA_018667255.1 Opitutae bacterium, ASM1866725v1
97 13 d__Bacteria;p__Cyanobacteria;c__Cyanobacteriia;o__Cyanobacteriales GCA_004294125.1 Oscillatoriales cyanobacterium, ASM429412v1
23 12 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Christensenellales GCA_017394825.1 Clostridia bacterium, ASM1739482v1
32 12 d__Bacteria;p__Actinobacteriota;c__Actinomycetia;o__Streptomycetales GCF_015356865.1 Catenulispora pinisilvae strain=NH11, ASM1535686v1
101 12 d__Bacteria;p__Nitrospirota;c__Thermodesulfovibrionia;o__UBA6902 GCA_011040095.1 Nitrospirae bacterium, ASM1104009v1
13 11 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales GCF_018195745.1 Lactococcus lactis subsp. lactis strain=LEY6, ASM1819574v1
35 11 d__Bacteria;p__Acidobacteriota GCA_002328195.1 Acidobacteria bacterium UBA2167, ASM232819v1
@ctb
Copy link
Owner Author

ctb commented Nov 22, 2022

maybe: "longest hash chain" at that taxon?

@ctb
Copy link
Owner Author

ctb commented Nov 22, 2022

ok, got a better way to do a breakdown of longest hash chain for specific taxa.

Per the table above, for d__Bacteria the longest hash chain is 364 hashes.

This hash chain is entirely part of
GCA_001894475, d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli,

which shares it across 165 partners - breakdown of top 10 partners and overlap below.

partner_ident partner_lin n_hashes
0 GCF_001481655 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Flavobacteriales;f__Flavobacteriaceae;g__Flavobacterium;s__Flavobacterium odoratimimum 46
1 GCF_012102505 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Vagococcaceae;g__Vagococcus;s__Vagococcus fluvialis 20
2 GCF_003039915 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus cohnii 15
3 GCF_009020275 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides uniformis 9
4 GCA_900758605 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Bacteroides;s__Bacteroides sp900552405 9
5 GCF_013009155 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus suis_W 8
6 GCF_003311455 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus 7
7 GCF_007293315 d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales_H;f__Bacillaceae_D;g__Alkalihalobacillus_A;s__Alkalihalobacillus_A sp007293315 7
8 GCF_001865835 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Flavobacteriales;f__Flavobacteriaceae;g__Flavobacterium;s__Flavobacterium odoratimimum 6
9 GCF_009648365 d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus epidermidis 6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant