Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using genome grist with custom genomes #91

Closed
smb20200615 opened this issue Jun 18, 2021 · 4 comments · Fixed by #130
Closed

Using genome grist with custom genomes #91

smb20200615 opened this issue Jun 18, 2021 · 4 comments · Fixed by #130

Comments

@smb20200615
Copy link

Hello,

I was wondering if there is a way to use genome-grist with custom genomes. I think that the genbank_genomes directory and the info.csv are not automatically generated.

Thank you!

@ctb
Copy link
Member

ctb commented Jun 18, 2021 via email

@ctb
Copy link
Member

ctb commented Jan 12, 2022

This is being tackled in #130, which is now working!! This issue will be closed when that's merged, and I'll cut a new release soon after.

@ctb
Copy link
Member

ctb commented Jan 15, 2022

Here are my notes from the design phase. They are not necessarily very coherent but might be useful even after #130 is merged and this issue is closed -

supporting private genomes in genome-grist

  • need to provide a way to specify list of private genomes corresponding to database.
  • ...should we build database from private genomes automatically? probably not, too big :(.
  • ...what do we do when private genomes are missing? good error messages? check up front?
  • ...do we formally separate private genomes from genbank downloadable? or integrate retrieval sections of Snakefile for generalized purposes later?
  • ...maybe provide a database type? genbank, private, ...? allows for img later.

getting started:

  • gather for podar, podar-ref works
  • summarize_gather - needs info.csv?
  • summarize_tax - needs

TODO:

  • document filename patterns for adding own samples: sigs/{sample}.abundtrim.sig

maybe:

  • change genbank_genomes to something else - configurable?
  • support multiple genomes dir? caching? project-specific directories?
  • support multiple taxonomy files
  • support multiple search databases
  • change database glob pattern to databases list
  • remove genbank from filenames, rule names, directory names...

notes to self

genome info.csv files

the genbank_genomes/{acc}.info.csv file contains

  • acc
  • genome_url
  • assembly_report_url
  • ncbi_tax_name

of these, only acc and ncbi_tax_name are used by the downstream notebooks - acc is split off to get genome_id which joins to gather & mapping result display, while ncbi_tax_name is used in names_df in the notebooks to display names.

So, we could just provide those two things for private databases.

More - these files end up getting concatenated into genbank/{sample}.genomes.info.csv with all fields. I think we can restrict it to just the two fields in that file.

So then we just want the files to be genomes/{acc}.info.csv.

gather (and prefetch) output

the name in the gather CSV output and match_name in the prefetch CSV output are split to generate {acc} which them gets normalized to make {genome_id} which is used to connect with other data frames, including names.

mapping output

mapping is summarized using the mapping filename, which includes {acc}. This is automatically dumped into the resulting summary.csv file, which is joined with the relevant dataframes in the notebooks. So I think we're good there.

database config options

planning on going with second, for now; simpler.

note to self: don't do namespaces for accessions, just order things so that private DBs override public, in order.

Q: do we do symbolic linking of genomes into common namespace under project? that seems sensible.

then genbank_genomes is treated like a cache, as is private genome directory. and... we have a special rule that links the genomes in to outputs directory?

# option 1:

databases:
    name1:
       - type: genbank
       - filenames:
          - /path/one
          - /path/two
       - taxonomy: /path/three
    name2:
       - type: private
       - filenames:
          - /path/four
       - taxonomy: /path/five

# option 2:

genbank_databases:
- /path/one
- /path/two

private_databases:
- /path/three
- /path/four

private_database_info:
- /path/seven.csv
- /path/eight.csv

taxonomies:
- /path/five
- /path/six

splitting the rules - genbank vs private

so we kinda want the genbank genomes and their info CSVs to be downloaded via snakemake - parallelization!

we don't really need that to happen for the private database info, I guess, but it might not hurt.

So how do we do that?

  • we could generate a list of targets dynamically, based on a checkpoint? will that even work?

The problem is that we want them all to end up in a single place, under {outdir}/genomes/, with the same naming scheme. ...or do we? An alternative would be to use a checkpoint class to generate two sets of inputs... hmm. like in make_combined_info_csv.

so, wait -

  • have a rule that downloads all of the genbank_genomes
  • have another rule that makes a list of all of the locations of all of the genbank files plus all of the private files. then run cp on that.

OR:

  • have a checkpoint rule that downloads all of the genbank genomes
  • have another rule that depends on that, and generates a list of all genome files from both private and genbank
  • then run a copy on that

then the error message will be appropriate ("file does not exist", etc. etc.)

so...

  • two checkpoints
  • one that makes just the genbank genomes paths
  • another that makes just the private genome paths

documentation

need:

  • private genomes organized in a certain way, with info.csv files
    • can be generated using scripts, etc.
  • taxonomy organized as per sourmash taxonomy input

limitations:

  • private "accessions" must be unique, have no underscores
  • private sample names must have no periods in them

@ctb ctb closed this as completed in #130 Jan 17, 2022
@ctb
Copy link
Member

ctb commented Jan 17, 2022

@smb20200615 this functionality is now released in genome-grist v0.8. Let me know if you end up using it; any feedback is most welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants