Using genome grist with custom genomes #91

smb20200615 · 2021-06-18T11:44:30Z

Hello,

I was wondering if there is a way to use genome-grist with custom genomes. I think that the genbank_genomes directory and the info.csv are not automatically generated.

Thank you!

ctb · 2021-06-18T12:49:40Z

Hi, sorry, not yet :). Definitely something we want to support, though!

ctb · 2022-01-12T14:24:53Z

This is being tackled in #130, which is now working!! This issue will be closed when that's merged, and I'll cut a new release soon after.

ctb · 2022-01-15T19:23:44Z

Here are my notes from the design phase. They are not necessarily very coherent but might be useful even after #130 is merged and this issue is closed -

supporting private genomes in genome-grist

need to provide a way to specify list of private genomes corresponding to database.
...should we build database from private genomes automatically? probably not, too big :(.
...what do we do when private genomes are missing? good error messages? check up front?
...do we formally separate private genomes from genbank downloadable? or integrate retrieval sections of Snakefile for generalized purposes later?
...maybe provide a database type? genbank, private, ...? allows for img later.

getting started:

gather for podar, podar-ref works
summarize_gather - needs info.csv?
summarize_tax - needs

TODO:

document filename patterns for adding own samples: sigs/{sample}.abundtrim.sig

maybe:

change genbank_genomes to something else - configurable?
support multiple genomes dir? caching? project-specific directories?
support multiple taxonomy files
support multiple search databases
change database glob pattern to databases list
remove genbank from filenames, rule names, directory names...

notes to self

genome info.csv files

the genbank_genomes/{acc}.info.csv file contains

acc
genome_url
assembly_report_url
ncbi_tax_name

of these, only acc and ncbi_tax_name are used by the downstream notebooks - acc is split off to get genome_id which joins to gather & mapping result display, while ncbi_tax_name is used in names_df in the notebooks to display names.

So, we could just provide those two things for private databases.

More - these files end up getting concatenated into genbank/{sample}.genomes.info.csv with all fields. I think we can restrict it to just the two fields in that file.

So then we just want the files to be genomes/{acc}.info.csv.

gather (and prefetch) output

the name in the gather CSV output and match_name in the prefetch CSV output are split to generate {acc} which them gets normalized to make {genome_id} which is used to connect with other data frames, including names.

mapping output

mapping is summarized using the mapping filename, which includes {acc}. This is automatically dumped into the resulting summary.csv file, which is joined with the relevant dataframes in the notebooks. So I think we're good there.

database config options

planning on going with second, for now; simpler.

note to self: don't do namespaces for accessions, just order things so that private DBs override public, in order.

Q: do we do symbolic linking of genomes into common namespace under project? that seems sensible.

then genbank_genomes is treated like a cache, as is private genome directory. and... we have a special rule that links the genomes in to outputs directory?

# option 1:

databases:
    name1:
       - type: genbank
       - filenames:
          - /path/one
          - /path/two
       - taxonomy: /path/three
    name2:
       - type: private
       - filenames:
          - /path/four
       - taxonomy: /path/five

# option 2:

genbank_databases:
- /path/one
- /path/two

private_databases:
- /path/three
- /path/four

private_database_info:
- /path/seven.csv
- /path/eight.csv

taxonomies:
- /path/five
- /path/six

splitting the rules - genbank vs private

so we kinda want the genbank genomes and their info CSVs to be downloaded via snakemake - parallelization!

we don't really need that to happen for the private database info, I guess, but it might not hurt.

So how do we do that?

we could generate a list of targets dynamically, based on a checkpoint? will that even work?

The problem is that we want them all to end up in a single place, under {outdir}/genomes/, with the same naming scheme. ...or do we? An alternative would be to use a checkpoint class to generate two sets of inputs... hmm. like in make_combined_info_csv.

so, wait -

have a rule that downloads all of the genbank_genomes
have another rule that makes a list of all of the locations of all of the genbank files plus all of the private files. then run cp on that.

OR:

have a checkpoint rule that downloads all of the genbank genomes
have another rule that depends on that, and generates a list of all genome files from both private and genbank
then run a copy on that

then the error message will be appropriate ("file does not exist", etc. etc.)

so...

two checkpoints
one that makes just the genbank genomes paths
another that makes just the private genome paths

documentation

need:

private genomes organized in a certain way, with info.csv files
- can be generated using scripts, etc.
taxonomy organized as per sourmash taxonomy input

limitations:

private "accessions" must be unique, have no underscores
private sample names must have no periods in them

ctb · 2022-01-17T16:22:01Z

@smb20200615 this functionality is now released in genome-grist v0.8. Let me know if you end up using it; any feedback is most welcome!

ctb mentioned this issue Jan 6, 2022

[MRG] support local genome collections (including private genomes) #130

Merged

20 tasks

ctb closed this as completed in #130 Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using genome grist with custom genomes #91

Using genome grist with custom genomes #91

smb20200615 commented Jun 18, 2021

ctb commented Jun 18, 2021 via email

ctb commented Jan 12, 2022

ctb commented Jan 15, 2022

ctb commented Jan 17, 2022

Using genome grist with custom genomes #91

Using genome grist with custom genomes #91

Comments

smb20200615 commented Jun 18, 2021

ctb commented Jun 18, 2021 via email

ctb commented Jan 12, 2022

ctb commented Jan 15, 2022

supporting private genomes in genome-grist

notes to self

genome info.csv files

gather (and prefetch) output

mapping output

database config options

splitting the rules - genbank vs private

documentation

ctb commented Jan 17, 2022