Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor genome_metadata dump module #453

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

ens-LCampbell
Copy link
Member

@ens-LCampbell ens-LCampbell commented Nov 8, 2024

I decided that given we have a meta data dump module, it made more sense to expand its functionality to facilitate customised meta data dumping.

This update expands said dump module, allowing users to suuply an input meta JSON (filter) with the user desired meta_keys. The default mode is no filter, and a internal meta_key:value set will be dumped.

Users also have the option to append the database name to the meta JSON output, I included this as was useful in a nextflow context when processing multiple dbs for instance.

E.g. of input JSON filter:
{ "database": { "name": "str" }, "assembly": { "accession": "str" }, "species": { "annotation_source": "str", "production_name": "str", "scientific_name": "str", "taxonomy_id": "int" } }

Optional params are as follows:
--metafilter PATH JSON file of nested meta_key:meta_value to filter dump output.
--meta_update Perform assembly and genebuild 'version' metadata checks & update if needed. (default: False)
--append_db Append core database name to output JSON. (default: False)

Default meta output (as defined internally i.e. no user meta filter applied):

{ "annotation": { "provider_name": "Ensembl", "provider_url": "www.ensembl.org" }, "assembly": { "accession": "GCA_903994105.1", "name": "B.tabaci_ASIAII5_Canu_n227_616Mb", "provider_name": "Ensembl Metazoa", "provider_url": "www.metazoa.ensembl.org", "version": 1 }, "database": { "name": "nftest_bemisia_tabaci_gca903994105v1_core_110_1" }, "genebuild": { "method": "import", "method_display": "Import", "start_date": "2019-12-ACWP", "version": "BtabASIAII5_1.0" }, "species": { "annotation_source": "Ensembl", "display_name": "Bemisia tabaci (sweet potato whitefly) - GCA_903994105.1 [Ensembl annotation]", "division": "EnsemblMetazoa", "production_name": "bemisia_tabaci_gca903994105v1", "scientific_name": "Bemisia tabaci", "strain": "Asia II-5", "taxonomy_id": 7038 } }

With filter output on e.g. filter above, and deactivating meta data updating (which is done by default):
--metafilter input_meta_filter.json --append_db --meta_update

{ "assembly": { "accession": "GCA_903994105.1" }, "database": { "name": "nftest_bemisia_tabaci_gca903994105v1_core_110_1" }, "species": { "annotation_source": "Ensembl", "production_name": "bemisia_tabaci_gca903994105v1", "scientific_name": "Bemisia tabaci", "taxonomy_id": "7038" } }

PR also includes:

  • Refactored and expanded test_dump module to account for changes to dumping module.
  • Update of doc string in: database/meta_getter & database/factory modules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant