Refactor genome_metadata dump module #453

ens-LCampbell · 2024-11-08T17:37:28Z

I decided that given we have a meta data dump module, it made more sense to expand its functionality to facilitate customised meta data dumping.

This update expands said dump module, allowing users to suuply an input meta JSON (filter) with the user desired meta_keys. The default mode is no filter, and a internal meta_key:value set will be dumped.

Users also have the option to append the database name to the meta JSON output, I included this as was useful in a nextflow context when processing multiple dbs for instance.

E.g. of input JSON filter:
{ "database": { "name": "str" }, "assembly": { "accession": "str" }, "species": { "annotation_source": "str", "production_name": "str", "scientific_name": "str", "taxonomy_id": "int" } }

Optional params are as follows:
--metafilter PATH JSON file of nested meta_key:meta_value to filter dump output.
--meta_update Perform assembly and genebuild 'version' metadata checks & update if needed. (default: False)
--append_db Append core database name to output JSON. (default: False)

Default meta output (as defined internally i.e. no user meta filter applied):

{ "annotation": { "provider_name": "Ensembl", "provider_url": "www.ensembl.org" }, "assembly": { "accession": "GCA_903994105.1", "name": "B.tabaci_ASIAII5_Canu_n227_616Mb", "provider_name": "Ensembl Metazoa", "provider_url": "www.metazoa.ensembl.org", "version": 1 }, "database": { "name": "nftest_bemisia_tabaci_gca903994105v1_core_110_1" }, "genebuild": { "method": "import", "method_display": "Import", "start_date": "2019-12-ACWP", "version": "BtabASIAII5_1.0" }, "species": { "annotation_source": "Ensembl", "display_name": "Bemisia tabaci (sweet potato whitefly) - GCA_903994105.1 [Ensembl annotation]", "division": "EnsemblMetazoa", "production_name": "bemisia_tabaci_gca903994105v1", "scientific_name": "Bemisia tabaci", "strain": "Asia II-5", "taxonomy_id": 7038 } }

With filter output on e.g. filter above, and deactivating meta data updating (which is done by default):
--metafilter input_meta_filter.json --append_db --meta_update

{ "assembly": { "accession": "GCA_903994105.1" }, "database": { "name": "nftest_bemisia_tabaci_gca903994105v1_core_110_1" }, "species": { "annotation_source": "Ensembl", "production_name": "bemisia_tabaci_gca903994105v1", "scientific_name": "Bemisia tabaci", "taxonomy_id": "7038" } }

PR also includes:

Refactored and expanded test_dump module to account for changes to dumping module.
Update of doc string in: database/meta_getter & database/factory modules.

ens-LCampbell added 3 commits November 8, 2024 17:17

Update doc string and main call

c700e45

Refactor dump, parse_args + add functionality

d76dd13

Add database name to default meta_data

da5f9e6

ens-LCampbell requested a review from JAlvarezJarreta November 8, 2024 17:37

ens-LCampbell self-assigned this Nov 8, 2024

ens-LCampbell requested review from vsitnik and Dishalodha November 8, 2024 17:39

ens-LCampbell added 2 commits November 8, 2024 17:45

Mypy Fix func type hint

7a05032

Fix pylint

ed14d5b

JAlvarezJarreta requested a review from shradhaebi November 11, 2024 09:07

ens-LCampbell added 9 commits November 11, 2024 18:46

Polish and refactor/update dump tests

399a06f

Update vars in test

3ab16b9

Update --meta_update description

4162068

Make pylint happy

10ca796

Make black happy

3c9ac32

black

3fbf7ad

Update factory main() docstring

01de7fe

black on database/factory

660a428

Update argparse help info

4bd2206

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor genome_metadata dump module #453

Refactor genome_metadata dump module #453

ens-LCampbell commented Nov 8, 2024 •

edited

Loading

Refactor genome_metadata dump module #453

Are you sure you want to change the base?

Refactor genome_metadata dump module #453

Conversation

ens-LCampbell commented Nov 8, 2024 • edited Loading

ens-LCampbell commented Nov 8, 2024 •

edited

Loading