Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] support local genome collections (including private genomes) #130

Merged
merged 66 commits into from
Jan 17, 2022

Conversation

ctb
Copy link
Member

@ctb ctb commented Jan 6, 2022

This PR enables the use of collections of local genomes with genome-grist. See the documentation for details!

Documentation is being written here, and is usually synced to this branch from hackmd. There is new mkdocs-formatted documentation available on github pages at dib-lab.github.io/genome-grist/.

There are many changes and incompatibilities with previous versions of genome-grist. Read on!

major changes and incompatibilities

  • sourmash databases are now specified as a list in the config, via sourmash_databases:, instead of sourmash_database_glob_pattern:;
  • sample must be changed to samples;
  • database_taxonomy must be changed to taxonomies
  • many top level rules have changed & the CLI has changed significantly - document, CTB!
  • most intermediate outputs need to be regenerated, including any CSV files and downstream handlers - in particular, acc has been changed to ident and ncbi_tax_name has been changed to display_name;

minor changes that don't require explicit intervention:

  • the first-round mapping directory name has been changed from minimap/ to to mapping/;
  • many internal rule names have changed to eliminate genbank;
  • the genbank/ output subdir has been renamed to gather/;
  • the gathertax/ output subdirectory has been removed;

This PR also:

  • removes the process CLI entry point
  • makes the genbank genomes cache configurable, and renames it to genbank_cache
  • adds an intermediate genomes/ output subdirectory that holds the private+genbank genomes and genome info CSVs.

related issues

Fixes #91 - supports custom genomes.
Fixes #13 - enabling private identifiers for genomes.
Fixes #9 - specifying genome list
Fixes #79 - genbank_genomes directory is now configurable
Fxies #132 - adds picklist to the gather step

notes to self and checklists

This means:

  • allowing custom sourmash databases (easy!)
  • supporting non-genbank genomes (...medium hard)
  • supporting custom genome information files (...tricky)
  • supporting custom taxonomies (...straightforward, one the hard stuff above is done)

TODO:

  • integrate the private genome copying in appropriately
  • assert that the old genbank glob stuff is Not Allowed in the config file
  • provide podar sample subset and instructions using podar-ref
  • update the generate info script/split it in two so that (1) info is generated and then (2) info.csv files are produced
  • maybe change acc to identifier, and ncbi_tax_name to name.
  • update genbank databases glob pattern in orig config files
  • check for old database_taxonomy and replace with taxonomies list
  • write tests for private genome collections
  • write tests for picklists
  • write tax test of some kind!
  • write docs about configuring private genome collections
  • add documentation about the updated CLI rule names
  • link docs together/be more coherent
  • update old README and docs; see also [WIP] genome-grist docs on GitHub Pages #97
  • add badges to README
  • revisit - do we even need to distinguish between genbank and private sourmash databases?

cc @jessicalumian

@ctb
Copy link
Member Author

ctb commented Jan 8, 2022

update: it's aliiiiiive!

I can now successfully run the summarize_gather step with the following config file, conf-private.yml:

sample:
- podar
outdir: outputs.private/
metagenome_trim_memory: 0

genbank_databases:
- tests/test-data/SRR5950647.x.gtdb-rs202.matches.zip

private_databases:
- databases/podar-ref.zip

private_databases_info:
- databases/podar-ref.info.csv

taxonomies:
- databases/podar-ref.tax.csv
- ../sourmash/gtdb-rs202.taxonomy.v2.csv

(taxonomies don't yet work, but I believe that will be easy.)

@ctb
Copy link
Member Author

ctb commented Jan 8, 2022

Whew, this is converging on working. And to assuage @taylorreiter there is now a lot less /genbank/ in everything.

Entertainingly, the only part that went super smoothly was the inclusion of private taxonomies. @bluegenes sourmash tax FTW!

@ctb
Copy link
Member Author

ctb commented Jan 8, 2022

🎉 and tests pass!

* change column names

* remove old notebooks

* fix mistake
@ctb
Copy link
Member Author

ctb commented Jan 16, 2022

So it turns out that there's no current need for separate private_databases and genbank_databases in the config file; they're treated the same and just combined 😆 . It's the identifiers that matter - if an identifier is in the private_databases_info file, it overrides the genbank lookup, which is otherwise the default.

I could leave it as it is, or I could switch things up and do something like sourmash_databases for the databases, and maybe make it identifier_info_file or something (to avoid the "private databases" moniker). Any thoughts either way?

@ctb ctb changed the title [WIP] support private genome collections [WIP] support local genome collections (including private genomes) Jan 16, 2022
@ctb
Copy link
Member Author

ctb commented Jan 16, 2022

update: I switched things around and I think it looks good this way.

(There's now simply sourmash_databases and local_databases_info config options.)

@ctb
Copy link
Member Author

ctb commented Jan 16, 2022

New! Shiny mkdocs docs.

https://dib-lab.github.io/genome-grist/

@ctb ctb changed the title [WIP] support local genome collections (including private genomes) [MRG] support local genome collections (including private genomes) Jan 17, 2022
@ctb
Copy link
Member Author

ctb commented Jan 17, 2022

🎉 merging!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants