Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make .sbt.zip versions of genbank databases #1036

Closed
ctb opened this issue Jun 20, 2020 · 16 comments
Closed

make .sbt.zip versions of genbank databases #1036

ctb opened this issue Jun 20, 2020 · 16 comments

Comments

@ctb
Copy link
Contributor

ctb commented Jun 20, 2020

genbank, genbank-d2, etc. can be easily made into .sbt.zip versions of themselves.

@ctb
Copy link
Contributor Author

ctb commented Jun 22, 2020

On farm, in ~ctbrown/new-dbs/ -

-r--r--r-- 1 ctbrown ctbrown 3.8G Jun 21 17:36 genbank-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.8G Jun 22 07:14 genbank-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jun 21 20:29 genbank-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.2G Jun 21 19:27 refseq-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.2G Jun 22 07:56 refseq-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jun 21 22:15 refseq-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.8G Jun 21 18:21 microbe-genbank-sbt-k21-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.8G Jun 21 12:31 microbe-genbank-sbt-k31-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jun 21 21:13 microbe-genbank-sbt-k51-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.2G Jun 21 18:56 microbe-refseq-sbt-k21-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.2G Jun 21 12:06 microbe-refseq-sbt-k31-2017.05.09.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jun 21 21:44 microbe-refseq-sbt-k51-2017.05.09.sbt.zip

@ctb
Copy link
Contributor Author

ctb commented Jun 22, 2020

To construct these, I did:

sourmash index -k KSIZE --traverse-directory NAME.sbt.zip old/.sbt.NAME/ -f

@ctb
Copy link
Contributor Author

ctb commented Jun 22, 2020

The last six in the list above are (as noted) the older databases from 2017.05.09, per the old databases.md.

The ones without dates are from 2018.03.29, per current databases.md.

@ctb
Copy link
Contributor Author

ctb commented Jun 22, 2020

I think one next step is to produce and/or verify a catalog of their contents from the sbt.json file, maybe using the code here #993

A different question is, once we think they're "correct", where shall we host them?

@ctb
Copy link
Contributor Author

ctb commented Jun 22, 2020

(incidentally, these databases are nice and small, and fairly fast to search! Resolves issues in #646 quite nicely.

@luizirber
Copy link
Member

I think one next step is to produce and/or verify a catalog of their contents from the sbt.json file, maybe using the code here #993

A different question is, once we think they're "correct", where shall we host them?

The official links in https://sourmash.readthedocs.io/en/latest/databases.html point to S3 buckets.
They are also available in https://osf.io/t3fqa/ (hosted in google drive).

The first one ends up being expensive, but convenient.
The second one is free, but wonkier (it might go away if something happens with google drive or the linking with OSF).

I think they are too big to put directly in OSF or in Zenodo (~90GB for all current DBs), but could someone check that info?

For now, I think putting in both S3 and OSF+gdrive, and start changing docs to point to OSF link?

@ctb
Copy link
Contributor Author

ctb commented Jul 4, 2020

I think we should supply signature catalogs (generated with sig describe) along with each file, and (for LCA db) maybe even lineage spreadsheets (per #1080).

@ctb
Copy link
Contributor Author

ctb commented Jul 4, 2020

OK. For both the old and new databases, I generated signature catalogs.

For a subset, I have everything - here's a minireport.

The old databases are consistent in number of signatures,

   91298  1040307 10417610 refseq-d2-k21.sbt.json.describe.txt.list
   91298  1040975 10433600 refseq-d2-k31.sbt.json.describe.txt.list
   91298  1040848 10420714 refseq-d2-k51.sbt.json.describe.txt.list

but have a number of differences:

% diff refseq-d2-k21.sbt.json.describe.txt.list refseq-d2-k31.sbt.json.describe.txt.list | diffstat
 unknown |14410 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 7205 insertions(+), 7205 deletions(-)

diff gazing suggests that most of the differences are in strain content in some way, that is, the accessions are different but the names are the same.

The new databases (constructed using sourmash index <name> <.sbt.json directory> --traverse-dir -f) have quite varying size, and contain many fewer signatures than the original databases:

   84254   955692  9564653 refseq-d2-k21.sbt.zip.describe.txt.list
   85927   976171  9780698 refseq-d2-k31.sbt.zip.describe.txt.list
   87617   997050  9992048 refseq-d2-k51.sbt.zip.describe.txt.list

and are also somewhat different --

diff refseq-d2-k21.sbt.zip.describe.txt.list refseq-d2-k31.sbt.zip.describe.txt.list | diffstat
 unknown | 5707 +++++++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 3690 insertions(+), 2017 deletions(-)

I wonder if this is related to #994 in some way?


Also, interestingly, the old databases have duplicate signature names:

% uniq -d refseq-d2-k21.sbt.json.describe.txt.list | wc -l
2855

and likewise have duplicate md5s in them,

% grep ^md5: refseq-d2-k21.sbt.json.describe.txt | sort | uniq -d | wc -l
2855

but the new databases do not have duplicate names or md5s.


Now that #1059 and #994 are both merged, I'm going to try building new sbt.zip databases with directly from the old sbt.json files and see what happens.

@ctb
Copy link
Contributor Author

ctb commented Jul 5, 2020

ok, did that --in ~ctbrown/new-db-2,

% ls -lh *.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 16:14 genbank-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 16:46 genbank-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 17:19 genbank-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 17:55 genbank-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 18:30 genbank-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 19:08 genbank-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 19:37 refseq-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 20:05 refseq-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.4G Jul  4 20:34 refseq-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 21:07 refseq-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 21:37 refseq-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.4G Jul  4 22:08 refseq-k51.sbt.zip

These all have consistent numbers of signatures --

% for i in *.describe.txt; do grep ^signature: $i | sort > $i.list; done
% wc -l *.list
    98139 genbank-d2-k21.sbt.zip.describe.txt.list
    98139 genbank-d2-k31.sbt.zip.describe.txt.list
    98139 genbank-d2-k51.sbt.zip.describe.txt.list
    98139 genbank-k21.sbt.zip.describe.txt.list
    98139 genbank-k31.sbt.zip.describe.txt.list
    98139 genbank-k51.sbt.zip.describe.txt.list
    91298 refseq-d2-k21.sbt.zip.describe.txt.list
    91298 refseq-d2-k31.sbt.zip.describe.txt.list
    91298 refseq-d2-k51.sbt.zip.describe.txt.list
    91298 refseq-k21.sbt.zip.describe.txt.list
    91298 refseq-k31.sbt.zip.describe.txt.list
    91298 refseq-k51.sbt.zip.describe.txt.list

and of course have the same differences between them as the originals --

% diff refseq-d2-k21.sbt.zip.describe.txt.list refseq-d2-k31.sbt.zip.describe.txt.list | diffstat
 unknown |14410 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 7205 insertions(+), 7205 deletions(-)

and we see the same duplicate names and md5sums --

uniq -d refseq-d2-k21.sbt.zip.describe.txt.list | wc -l
2855
% grep ^md5: refseq-d2-k21.sbt.zip.describe.txt | sort | uniq -d | wc -l
2855

so I think what happened here is some combination of the following --

  • the original database construction did something funky with either indexing multiple genomes more than once, or (quite likely) collapsing identical md5sum genomes to a single name.
  • the first time I rebuilt the databases, these duplicate signatures got lost; but once Deal with duplicated MD5 in storages #994 was merged, we could faithfully represent these using modern SBTs.

in any case, the new .sbt.zip are consistent with the original .sbt.json so I think we're good to go.

All of the databases are now in /home/ctbrown/new-db-2, together with manifests constructed by sourmash sig describe.

@ctb
Copy link
Contributor Author

ctb commented Jul 5, 2020

next steps: per luiz,

I think they are too big to put directly in OSF or in Zenodo (~90GB for all current DBs), but could someone check that info?

For now, I think putting in both S3 and OSF+gdrive, and start changing docs to point to OSF link?

@ctb
Copy link
Contributor Author

ctb commented Jul 6, 2020

I added

/home/ctbrown/new-db-2/almeida-mags-k31.sbt.zip
/home/ctbrown/new-db-2/nayfach-k31.sbt.zip
/home/ctbrown/new-db-2/pasolli-mags-k31.sbt.zip

cc @taylorreiter

@luizirber
Copy link
Member

% ls -lh *.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 16:14 genbank-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 16:46 genbank-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 17:19 genbank-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 17:55 genbank-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 18:30 genbank-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.9G Jul  4 19:08 genbank-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 19:37 refseq-d2-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 20:05 refseq-d2-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.4G Jul  4 20:34 refseq-d2-k51.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 21:07 refseq-k21.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.3G Jul  4 21:37 refseq-k31.sbt.zip
-r--r--r-- 1 ctbrown ctbrown 3.4G Jul  4 22:08 refseq-k51.sbt.zip

Just to be clear, what is the difference between genbank-d2-k51.sbt.zip and genbank-k51.sbt.zip? Each pair has very similar sizes, are they calculated from the same source?

% diff refseq-d2-k21.sbt.zip.describe.txt.list refseq-d2-k31.sbt.zip.describe.txt.list | diffstat
 unknown |14410 ++++++++++++++++++++++++++++++++--------------------------------
 1 file changed, 7205 insertions(+), 7205 deletions(-)

(you're diffing two different ksizes here, is that on purpose?)

All of the databases are now in /home/ctbrown/new-db-2, together with manifests constructed by sourmash sig describe.

I copied them to S3 and gdrive, under a new zip directory.
https://osf.io/t3fqa/

@ctb
Copy link
Contributor Author

ctb commented Jul 7, 2020

I copied them to S3 and gdrive, under a new zip directory.
https://osf.io/t3fqa/

thank you!

Just to be clear, what is the difference between genbank-d2-k51.sbt.zip and genbank-k51.sbt.zip? Each pair has very similar sizes, are they calculated from the same source?

see #1036 (comment) - the genbank-d2 are the more recent ones, and were challenging for me to use on my laptop before the sbt zip files (as opposed to the older genbank ones).

(you're diffing two different ksizes here, is that on purpose?)

yes, I was thinking that the databases should have the same contents independent of ksize; the .list files are just the signature names...

@ctb
Copy link
Contributor Author

ctb commented Jul 7, 2020

cool, looks like contents of genbank and genbank-d2 are the same!

>>> x = [ x.strip() for x in open('genbank-d2-k31.sbt.zip.describe.txt.list', 'rt') ]
>>> list1 = [ x.strip() for x in open('genbank-d2-k31.sbt.zip.describe.txt.list', 'rt') ]
>>> list2 = [ x.strip() for x in open('genbank-k31.sbt.zip.describe.txt.list', 'rt') ]
>>> list1 = set(list1)
>>> list2 = set(list2)
>>> len(list1.intersection(list2))
93249
>>> len(list1.union(list2))
93249
>>> list1 = [ x.strip() for x in open('genbank-d2-k51.sbt.zip.describe.txt.list', 'rt') ]
>>> list2 = [ x.strip() for x in open('genbank-k51.sbt.zip.describe.txt.list', 'rt') ]
>>> list1 = set(list1)
>>> list2 = set(list2)
>>> len(list1.union(list2))
94976
>>> len(list1.intersection(list2))
94976

or at least their Jaccard similarities are 1.0 😉

@luizirber
Copy link
Member

(you're diffing two different ksizes here, is that on purpose?)

yes, I was thinking that the databases should have the same contents independent of ksize; the .list files are just the signature names...

I suspected, but didn't check the content of the .list files 🙈

cool, looks like contents of genbank and genbank-d2 are the same!

makes sense, I think they are the same SBT but the newer one is v4, the older one is v3. Now they are both v6 =]

I'll keep only one of each, and save some storage space.

@luizirber
Copy link
Member

Fixed in #1084

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants