Refactor genome_metadata dump module #453
Open
+366
−39
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I decided that given we have a meta data dump module, it made more sense to expand its functionality to facilitate customised meta data dumping.
This update expands said dump module, allowing users to suuply an input meta JSON (filter) with the user desired meta_keys. The default mode is no filter, and a internal meta_key:value set will be dumped.
Users also have the option to append the database name to the meta JSON output, I included this as was useful in a nextflow context when processing multiple dbs for instance.
E.g. of input JSON filter:
{ "database": { "name": "str" }, "assembly": { "accession": "str" }, "species": { "annotation_source": "str", "production_name": "str", "scientific_name": "str", "taxonomy_id": "int" } }
Optional params are as follows:
--metafilter PATH JSON file of nested meta_key:meta_value to filter dump output.
--meta_update Perform assembly and genebuild 'version' metadata checks & update if needed. (default: False)
--append_db Append core database name to output JSON. (default: False)
Default meta output (as defined internally i.e. no user meta filter applied):
{ "annotation": { "provider_name": "Ensembl", "provider_url": "www.ensembl.org" }, "assembly": { "accession": "GCA_903994105.1", "name": "B.tabaci_ASIAII5_Canu_n227_616Mb", "provider_name": "Ensembl Metazoa", "provider_url": "www.metazoa.ensembl.org", "version": 1 }, "database": { "name": "nftest_bemisia_tabaci_gca903994105v1_core_110_1" }, "genebuild": { "method": "import", "method_display": "Import", "start_date": "2019-12-ACWP", "version": "BtabASIAII5_1.0" }, "species": { "annotation_source": "Ensembl", "display_name": "Bemisia tabaci (sweet potato whitefly) - GCA_903994105.1 [Ensembl annotation]", "division": "EnsemblMetazoa", "production_name": "bemisia_tabaci_gca903994105v1", "scientific_name": "Bemisia tabaci", "strain": "Asia II-5", "taxonomy_id": 7038 } }
With filter output on e.g. filter above, and deactivating meta data updating (which is done by default):
--metafilter input_meta_filter.json --append_db --meta_update
{ "assembly": { "accession": "GCA_903994105.1" }, "database": { "name": "nftest_bemisia_tabaci_gca903994105v1_core_110_1" }, "species": { "annotation_source": "Ensembl", "production_name": "bemisia_tabaci_gca903994105v1", "scientific_name": "Bemisia tabaci", "taxonomy_id": "7038" } }
PR also includes: