-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make .sbt.zip versions of genbank databases #1036
Comments
On farm, in
|
To construct these, I did:
|
The last six in the list above are (as noted) the older databases from 2017.05.09, per the old databases.md. The ones without dates are from 2018.03.29, per current databases.md. |
I think one next step is to produce and/or verify a catalog of their contents from the sbt.json file, maybe using the code here #993 A different question is, once we think they're "correct", where shall we host them? |
(incidentally, these databases are nice and small, and fairly fast to search! Resolves issues in #646 quite nicely. |
The official links in https://sourmash.readthedocs.io/en/latest/databases.html point to S3 buckets. The first one ends up being expensive, but convenient. I think they are too big to put directly in OSF or in Zenodo (~90GB for all current DBs), but could someone check that info? For now, I think putting in both S3 and OSF+gdrive, and start changing docs to point to OSF link? |
I think we should supply signature catalogs (generated with |
OK. For both the old and new databases, I generated signature catalogs. For a subset, I have everything - here's a minireport. The old databases are consistent in number of signatures,
but have a number of differences:
diff gazing suggests that most of the differences are in strain content in some way, that is, the accessions are different but the names are the same. The new databases (constructed using
and are also somewhat different --
I wonder if this is related to #994 in some way? Also, interestingly, the old databases have duplicate signature names:
and likewise have duplicate md5s in them,
but the new databases do not have duplicate names or md5s. Now that #1059 and #994 are both merged, I'm going to try building new sbt.zip databases with directly from the old sbt.json files and see what happens. |
ok, did that --in ~ctbrown/new-db-2,
These all have consistent numbers of signatures --
and of course have the same differences between them as the originals --
and we see the same duplicate names and md5sums --
so I think what happened here is some combination of the following --
in any case, the new .sbt.zip are consistent with the original .sbt.json so I think we're good to go. All of the databases are now in /home/ctbrown/new-db-2, together with manifests constructed by |
next steps: per luiz,
|
I added
|
Just to be clear, what is the difference between
(you're diffing two different
I copied them to S3 and gdrive, under a new |
thank you!
see #1036 (comment) - the genbank-d2 are the more recent ones, and were challenging for me to use on my laptop before the sbt zip files (as opposed to the older genbank ones).
yes, I was thinking that the databases should have the same contents independent of ksize; the .list files are just the signature names... |
cool, looks like contents of genbank and genbank-d2 are the same!
or at least their Jaccard similarities are 1.0 😉 |
I suspected, but didn't check the content of the .list files 🙈
makes sense, I think they are the same SBT but the newer one is v4, the older one is v3. Now they are both v6 =] I'll keep only one of each, and save some storage space. |
Fixed in #1084 |
genbank, genbank-d2, etc. can be easily made into .sbt.zip versions of themselves.
The text was updated successfully, but these errors were encountered: