make .sbt.zip versions of genbank databases #1036

ctb · 2020-06-20T17:50:41Z

genbank, genbank-d2, etc. can be easily made into .sbt.zip versions of themselves.

ctb · 2020-06-22T13:03:08Z

On farm, in ~ctbrown/new-dbs/ -

-r--r--r-- 1 ctbrown ctbrown 3.8G Jun 21 17:36 genbank-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.8G Jun 22 07:14 genbank-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jun 21 20:29 genbank-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.2G Jun 21 19:27 refseq-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.2G Jun 22 07:56 refseq-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jun 21 22:15 refseq-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.8G Jun 21 18:21 microbe-genbank-sbt-k21-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.8G Jun 21 12:31 microbe-genbank-sbt-k31-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jun 21 21:13 microbe-genbank-sbt-k51-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.2G Jun 21 18:56 microbe-refseq-sbt-k21-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.2G Jun 21 12:06 microbe-refseq-sbt-k31-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jun 21 21:44 microbe-refseq-sbt-k51-2017.05.09.sbt.zip

ctb · 2020-06-22T15:05:26Z

To construct these, I did:

sourmash index -k KSIZE --traverse-directory NAME.sbt.zip old/.sbt.NAME/ -f

ctb · 2020-06-22T15:08:01Z

The last six in the list above are (as noted) the older databases from 2017.05.09, per the old databases.md.

The ones without dates are from 2018.03.29, per current databases.md.

ctb · 2020-06-22T15:10:05Z

I think one next step is to produce and/or verify a catalog of their contents from the sbt.json file, maybe using the code here #993

A different question is, once we think they're "correct", where shall we host them?

ctb · 2020-06-22T15:26:20Z

(incidentally, these databases are nice and small, and fairly fast to search! Resolves issues in #646 quite nicely.

luizirber · 2020-06-27T18:22:32Z

I think one next step is to produce and/or verify a catalog of their contents from the sbt.json file, maybe using the code here #993

A different question is, once we think they're "correct", where shall we host them?

The official links in https://sourmash.readthedocs.io/en/latest/databases.html point to S3 buckets.
They are also available in https://osf.io/t3fqa/ (hosted in google drive).

The first one ends up being expensive, but convenient.
The second one is free, but wonkier (it might go away if something happens with google drive or the linking with OSF).

I think they are too big to put directly in OSF or in Zenodo (~90GB for all current DBs), but could someone check that info?

For now, I think putting in both S3 and OSF+gdrive, and start changing docs to point to OSF link?

ctb · 2020-07-04T14:43:16Z

I think we should supply signature catalogs (generated with sig describe) along with each file, and (for LCA db) maybe even lineage spreadsheets (per #1080).

ctb · 2020-07-04T16:03:22Z

OK. For both the old and new databases, I generated signature catalogs.

For a subset, I have everything - here's a minireport.

The old databases are consistent in number of signatures,

   91298  1040307 10417610 refseq-d2-k21.sbt.json.describe.txt.list
   91298  1040975 10433600 refseq-d2-k31.sbt.json.describe.txt.list
   91298  1040848 10420714 refseq-d2-k51.sbt.json.describe.txt.list

but have a number of differences:

% diff refseq-d2-k21.sbt.json.describe.txt.list refseq-d2-k31.sbt.json.describe.txt.list | diffstat
 unknown |14410 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 7205 insertions(+), 7205 deletions(-)

diff gazing suggests that most of the differences are in strain content in some way, that is, the accessions are different but the names are the same.

The new databases (constructed using sourmash index <name> <.sbt.json directory> --traverse-dir -f) have quite varying size, and contain many fewer signatures than the original databases:

   84254   955692  9564653 refseq-d2-k21.sbt.zip.describe.txt.list
   85927   976171  9780698 refseq-d2-k31.sbt.zip.describe.txt.list
   87617   997050  9992048 refseq-d2-k51.sbt.zip.describe.txt.list

and are also somewhat different --

diff refseq-d2-k21.sbt.zip.describe.txt.list refseq-d2-k31.sbt.zip.describe.txt.list | diffstat
 unknown | 5707 +++++++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 3690 insertions(+), 2017 deletions(-)

I wonder if this is related to #994 in some way?

Also, interestingly, the old databases have duplicate signature names:

% uniq -d refseq-d2-k21.sbt.json.describe.txt.list | wc -l
2855

and likewise have duplicate md5s in them,

% grep ^md5: refseq-d2-k21.sbt.json.describe.txt | sort | uniq -d | wc -l
2855

but the new databases do not have duplicate names or md5s.

Now that #1059 and #994 are both merged, I'm going to try building new sbt.zip databases with directly from the old sbt.json files and see what happens.

ctb · 2020-07-05T13:27:04Z

ok, did that --in ~ctbrown/new-db-2,

% ls -lh *.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 16:14 genbank-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 16:46 genbank-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 17:19 genbank-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 17:55 genbank-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 18:30 genbank-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 19:08 genbank-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 19:37 refseq-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 20:05 refseq-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.4G Jul  4 20:34 refseq-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 21:07 refseq-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 21:37 refseq-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.4G Jul  4 22:08 refseq-k51.sbt.zip

These all have consistent numbers of signatures --

% for i in *.describe.txt; do grep ^signature: $i | sort > $i.list; done
% wc -l *.list
    98139 genbank-d2-k21.sbt.zip.describe.txt.list
    98139 genbank-d2-k31.sbt.zip.describe.txt.list
    98139 genbank-d2-k51.sbt.zip.describe.txt.list
    98139 genbank-k21.sbt.zip.describe.txt.list
    98139 genbank-k31.sbt.zip.describe.txt.list
    98139 genbank-k51.sbt.zip.describe.txt.list
    91298 refseq-d2-k21.sbt.zip.describe.txt.list
    91298 refseq-d2-k31.sbt.zip.describe.txt.list
    91298 refseq-d2-k51.sbt.zip.describe.txt.list
    91298 refseq-k21.sbt.zip.describe.txt.list
    91298 refseq-k31.sbt.zip.describe.txt.list
    91298 refseq-k51.sbt.zip.describe.txt.list

and of course have the same differences between them as the originals --

% diff refseq-d2-k21.sbt.zip.describe.txt.list refseq-d2-k31.sbt.zip.describe.txt.list | diffstat
 unknown |14410 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 7205 insertions(+), 7205 deletions(-)

and we see the same duplicate names and md5sums --

uniq -d refseq-d2-k21.sbt.zip.describe.txt.list | wc -l
2855
% grep ^md5: refseq-d2-k21.sbt.zip.describe.txt | sort | uniq -d | wc -l
2855

so I think what happened here is some combination of the following --

the original database construction did something funky with either indexing multiple genomes more than once, or (quite likely) collapsing identical md5sum genomes to a single name.
the first time I rebuilt the databases, these duplicate signatures got lost; but once Deal with duplicated MD5 in storages #994 was merged, we could faithfully represent these using modern SBTs.

in any case, the new .sbt.zip are consistent with the original .sbt.json so I think we're good to go.

All of the databases are now in /home/ctbrown/new-db-2, together with manifests constructed by sourmash sig describe.

ctb · 2020-07-05T13:27:55Z

next steps: per luiz,

I think they are too big to put directly in OSF or in Zenodo (~90GB for all current DBs), but could someone check that info?

For now, I think putting in both S3 and OSF+gdrive, and start changing docs to point to OSF link?

ctb · 2020-07-06T18:35:46Z

I added

/home/ctbrown/new-db-2/almeida-mags-k31.sbt.zip
/home/ctbrown/new-db-2/nayfach-k31.sbt.zip
/home/ctbrown/new-db-2/pasolli-mags-k31.sbt.zip

cc @taylorreiter

luizirber · 2020-07-07T01:28:55Z

% ls -lh *.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 16:14 genbank-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 16:46 genbank-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 17:19 genbank-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 17:55 genbank-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 18:30 genbank-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 19:08 genbank-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 19:37 refseq-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 20:05 refseq-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.4G Jul  4 20:34 refseq-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 21:07 refseq-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 21:37 refseq-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.4G Jul  4 22:08 refseq-k51.sbt.zip

Just to be clear, what is the difference between genbank-d2-k51.sbt.zip and genbank-k51.sbt.zip? Each pair has very similar sizes, are they calculated from the same source?

% diff refseq-d2-k21.sbt.zip.describe.txt.list refseq-d2-k31.sbt.zip.describe.txt.list | diffstat
 unknown |14410 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 7205 insertions(+), 7205 deletions(-)

(you're diffing two different ksizes here, is that on purpose?)

All of the databases are now in /home/ctbrown/new-db-2, together with manifests constructed by sourmash sig describe.

I copied them to S3 and gdrive, under a new zip directory.
https://osf.io/t3fqa/

ctb · 2020-07-07T02:52:28Z

I copied them to S3 and gdrive, under a new zip directory.
https://osf.io/t3fqa/

thank you!

Just to be clear, what is the difference between genbank-d2-k51.sbt.zip and genbank-k51.sbt.zip? Each pair has very similar sizes, are they calculated from the same source?

see #1036 (comment) - the genbank-d2 are the more recent ones, and were challenging for me to use on my laptop before the sbt zip files (as opposed to the older genbank ones).

(you're diffing two different ksizes here, is that on purpose?)

yes, I was thinking that the databases should have the same contents independent of ksize; the .list files are just the signature names...

ctb · 2020-07-07T13:27:25Z

cool, looks like contents of genbank and genbank-d2 are the same!

>>> x = [ x.strip() for x in open('genbank-d2-k31.sbt.zip.describe.txt.list', 'rt') ]
>>> list1 = [ x.strip() for x in open('genbank-d2-k31.sbt.zip.describe.txt.list', 'rt') ]
>>> list2 = [ x.strip() for x in open('genbank-k31.sbt.zip.describe.txt.list', 'rt') ]
>>> list1 = set(list1)
>>> list2 = set(list2)
>>> len(list1.intersection(list2))
93249
>>> len(list1.union(list2))
93249
>>> list1 = [ x.strip() for x in open('genbank-d2-k51.sbt.zip.describe.txt.list', 'rt') ]
>>> list2 = [ x.strip() for x in open('genbank-k51.sbt.zip.describe.txt.list', 'rt') ]
>>> list1 = set(list1)
>>> list2 = set(list2)
>>> len(list1.union(list2))
94976
>>> len(list1.intersection(list2))
94976

or at least their Jaccard similarities are 1.0 😉

luizirber · 2020-07-07T16:31:11Z

(you're diffing two different ksizes here, is that on purpose?)

yes, I was thinking that the databases should have the same contents independent of ksize; the .list files are just the signature names...

I suspected, but didn't check the content of the .list files 🙈

cool, looks like contents of genbank and genbank-d2 are the same!

makes sense, I think they are the same SBT but the newer one is v4, the older one is v3. Now they are both v6 =]

I'll keep only one of each, and save some storage space.

luizirber · 2020-07-17T15:43:36Z

Fixed in #1084

luizirber mentioned this issue Jul 7, 2020

[MRG] Update DB links in docs #1084

Merged

5 tasks

luizirber closed this as completed Jul 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make .sbt.zip versions of genbank databases #1036

make .sbt.zip versions of genbank databases #1036

ctb commented Jun 20, 2020

ctb commented Jun 22, 2020 •

edited

Loading

ctb commented Jun 22, 2020

ctb commented Jun 22, 2020

ctb commented Jun 22, 2020

ctb commented Jun 22, 2020

luizirber commented Jun 27, 2020

ctb commented Jul 4, 2020

ctb commented Jul 4, 2020 •

edited

Loading

ctb commented Jul 5, 2020

ctb commented Jul 5, 2020

ctb commented Jul 6, 2020

luizirber commented Jul 7, 2020

ctb commented Jul 7, 2020

ctb commented Jul 7, 2020 •

edited

Loading

luizirber commented Jul 7, 2020

luizirber commented Jul 17, 2020

make .sbt.zip versions of genbank databases #1036

make .sbt.zip versions of genbank databases #1036

Comments

ctb commented Jun 20, 2020

ctb commented Jun 22, 2020 • edited Loading

ctb commented Jun 22, 2020

ctb commented Jun 22, 2020

ctb commented Jun 22, 2020

ctb commented Jun 22, 2020

luizirber commented Jun 27, 2020

ctb commented Jul 4, 2020

ctb commented Jul 4, 2020 • edited Loading

ctb commented Jul 5, 2020

ctb commented Jul 5, 2020

ctb commented Jul 6, 2020

luizirber commented Jul 7, 2020

ctb commented Jul 7, 2020

ctb commented Jul 7, 2020 • edited Loading

luizirber commented Jul 7, 2020

luizirber commented Jul 17, 2020

ctb commented Jun 22, 2020 •

edited

Loading

ctb commented Jul 4, 2020 •

edited

Loading

ctb commented Jul 7, 2020 •

edited

Loading