Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] support local genome collections (including private genomes) #130

Merged
merged 66 commits into from
Jan 17, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
5e4d7d0
rename rule to sourmash_prefetch_wc
ctb Jan 6, 2022
858e966
start using {outdir}/genomes/
ctb Jan 6, 2022
238f194
swizzle up config to allowprivate_databases and genbank_databases, etc.
ctb Jan 7, 2022
c17a8b5
more progress: copying private genomes around
ctb Jan 7, 2022
692e4fd
combine listing private and genbank genomes - seems to work\!
ctb Jan 8, 2022
aa3641f
simplify the ListGenomes stuff
ctb Jan 8, 2022
5ef6e33
it's aliiiiiiiive
ctb Jan 8, 2022
5c70c56
remove genbank accession requirement
ctb Jan 8, 2022
766cdd5
remove genbank from most filenames, rules
ctb Jan 8, 2022
b3eb9d8
rename minimap to mapping; add clean_gather
ctb Jan 8, 2022
efc78c1
updated to properly (?) use checkpoints throughout
ctb Jan 8, 2022
4deab0a
tests pass locally
ctb Jan 8, 2022
cb8bf37
fix typo
ctb Jan 8, 2022
7b8e8d1
add the beginnings of testing for private databases
ctb Jan 8, 2022
ff04270
getting started
ctb Jan 10, 2022
80fa463
update all the things
ctb Jan 11, 2022
79626e5
[MRG] Change column names in intermediate CSVs. (#133)
ctb Jan 11, 2022
453244e
comments etc.
ctb Jan 12, 2022
c63442f
remove glob pattern, configure genbank_cache
ctb Jan 12, 2022
06210b9
remove 'process' command
ctb Jan 12, 2022
c8a9413
check for old config file params
ctb Jan 12, 2022
c729c71
add important comment
ctb Jan 12, 2022
6fc8d86
actually remove 'process'
ctb Jan 12, 2022
9c5429f
check for 'database_taxonomy' instead of 'taxonomies'
ctb Jan 12, 2022
d381f6d
add trailing / in Makefile
ctb Jan 12, 2022
ea66682
add default taxonomies file to system.conf
ctb Jan 12, 2022
1ac3c58
fix test files
ctb Jan 12, 2022
eb83f2c
fix conf-private.yml
ctb Jan 12, 2022
d0207ad
start of doc/ subdirectory
ctb Jan 14, 2022
a7e19cb
initial commit
hackmd-deploy Jan 13, 2022
e2104a2
add badge
hackmd-deploy Jan 14, 2022
786dcc9
compleat first draft
hackmd-deploy Jan 14, 2022
116702e
minor corrections
hackmd-deploy Jan 14, 2022
d4666fd
spell check
ctb Jan 14, 2022
28ea610
add picklists into the config (#136)
ctb Jan 15, 2022
91949fb
fix 'taxonomies' in test config; check that it's a list
ctb Jan 15, 2022
19692c7
add comment
ctb Jan 15, 2022
cd13ce5
swipe from #97
hackmd-deploy Jan 15, 2022
69f2ea4
Merge branch 'allow/private' of github.com:dib-lab/genome-grist into …
ctb Jan 15, 2022
883e4b1
swipe getting started from #97
ctb Jan 15, 2022
683ea66
update!
hackmd-deploy Jan 15, 2022
4d76d06
Apply suggestions from Taylor's docs review
ctb Jan 15, 2022
07786d4
more update in re taylor's suggestions
hackmd-deploy Jan 15, 2022
429d1ed
more more update
hackmd-deploy Jan 15, 2022
37f62f1
even more update
hackmd-deploy Jan 15, 2022
88ad402
more update
hackmd-deploy Jan 15, 2022
ee8e371
more update
hackmd-deploy Jan 15, 2022
459ee8b
fix help output for CLI
ctb Jan 15, 2022
2d82031
Merge branch 'allow/private' of github.com:dib-lab/genome-grist into …
ctb Jan 15, 2022
0ae63b7
configure mkdocs
ctb Jan 15, 2022
1f9aa1c
clean it out
hackmd-deploy Jan 15, 2022
b456353
update gitignore
ctb Jan 15, 2022
96769e5
add some figures
ctb Jan 15, 2022
7878268
upd
ctb Jan 15, 2022
7f743cc
more figure adjustment
ctb Jan 15, 2022
5bcc054
add badges
hackmd-deploy Jan 16, 2022
679c4ee
simplify to single sourmash_dtabases; use 'local' instead of 'private'
ctb Jan 16, 2022
8900a7d
update to 'local' instead of 'private'
hackmd-deploy Jan 16, 2022
ee552bf
fix extra backquote
hackmd-deploy Jan 16, 2022
0a0dc82
more fix?
hackmd-deploy Jan 16, 2022
03716aa
fix formatting
ctb Jan 16, 2022
c25c815
add tax test
ctb Jan 17, 2022
58bac94
add test for picklist
ctb Jan 17, 2022
d250a46
switch SRR5950647_subset over to use local_databses_info :tada:
ctb Jan 17, 2022
5515be8
cleanup & commenting
ctb Jan 17, 2022
367fbb1
add missing file
ctb Jan 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,8 @@ dist/
genome_grist.egg-info/
genome_grist/version.py
outputs.*
genbank_cache
*.yml
site
.DS_Store
bak
44 changes: 38 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
all: clean-test test

flakes:
flake8 --ignore=E501 genome_grist/ tests/

black:
black .

clean-test:
rm -fr outputs.test/

Expand All @@ -8,15 +14,41 @@ test:
genome-grist run tests/test-data/SRR5950647.conf summarize_mapping summarize_tax make_sgc_conf -j 8 -p

# try various targets to make sure they work
genome-grist run tests/test-data/SRR5950647.conf download_matching_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf download_matching_genomes_info -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf download_genbank_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf combine_genome_info -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf retrieve_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf estimate_distinct_kmers -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf count_trimmed_reads -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf summarize_sample_info -j 8 -p

### private/local genomes test stuff

flakes:
flake8 --ignore=E501 genome_grist/ tests/
test-private: outputs.private/abundtrim/podar.abundtrim.fq.gz \
databases/podar-ref.zip databases/podar-ref.info.csv \
databases/podar-ref.tax.csv
genome-grist run conf-private.yml summarize_gather summarize_mapping summarize_tax -j 4 -p

black:
black .
# download the (subsampled) reads for SRR606249
outputs.private/abundtrim/podar.abundtrim.fq.gz:
mkdir -p outputs.private/abundtrim
curl -L https://osf.io/ckbq3/download -o outputs.private/abundtrim/podar.abundtrim.fq.gz

# download the ref genomes
databases/podar-ref/:
mkdir -p databases/podar-ref
curl -L https://osf.io/vbhy5/download -o databases/podar-ref.tar.gz
cd databases/podar-ref/ && tar xzf ../podar-ref.tar.gz

# sketch the ref genomes
databases/podar-ref.zip: databases/podar-ref/
sourmash sketch dna -p k=31,scaled=1000 --name-from-first \
databases/podar-ref/*.fa -o databases/podar-ref.zip

# download taxonomy
databases/podar-ref.tax.csv:
curl -L https://osf.io/4yhjw/download -o databases/podar-ref.tax.csv

# create info file and genomes directory:
databases/podar-ref.info.csv:
python -m genome_grist.copy_local_genomes databases/podar-ref/*.fa -o databases/podar-ref.info.csv -d databases/podar-ref.d
python -m genome_grist.make_info_file databases/podar-ref.info.csv
163 changes: 17 additions & 146 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,156 +1,27 @@
# genome-grist: a quickstart tutorial.
# genome-grist README

This quickstart tutorial will take about 30 minutes to run, and
requires 5 GB of disk space and 4 GB of RAM, as well as a fairly
good Internet connection.
<!-- CTB: this is /README.md in dib-lab/genome-grist -->

## What is genome-grist?
<a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a>
<img alt="License: 3-Clause BSD" src="https://img.shields.io/badge/License-BSD%203--Clause-blue.svg">

genome-grist is software that automates a number of tedious metagenome tasks related to reference-based analyses on Illumina metagenomes. Specifically, genome-grist will download public metagenomes from the SRA, preprocess them, and use `sourmash gather` to identify reference genomes for the metagenome. It will then download the reference genomes, map reads to them, and summarize the mapping.
genome-grist analyzes the strain composition of microbial metagenomes
using
[minimum metagenome covers](https://dib-lab.github.io/2020-paper-sourmash-gather/)
and produces a variety of compositional and taxonomic summaries.

## Installing genome-grist
Check out the
[quick start!](https://dib-lab.github.io/genome-grist/quickstart/) And
please also see
[the rest of the docs](https://dib-lab.github.io/genome-grist/) for
more information!

We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is <a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a>).
## Example: the strain composition of a gut microbiome (iHMP)

```
conda create -y -n grist python=3.8 pip
conda activate grist
python -m pip install genome-grist
```
## Running genome-grist
This figure was autogenerated by genome-grist.

We currently recommend running genome-grist in its own directory, for several reasons that include software installation (genome-grist uses snakemake and conda to install software under this directory).

Within the current working directory, genome-grist will create an `inputs` subdir, a `genbank_genomes` subdir, and any `outputs.NAME` subdirectories required by the configuration; it should be straightforward to keep projects separate by configuring the output directories appropriately.

So, create a subdirectory and change into it:
```shell
mkdir grist/
cd grist/
```
Note, genome-grist does not rely on the directory name or location in any way; it works entirely within the current working directory.

### Download a small example database

Download the GTDB release 95 set of ~32k guide genomes, in a pre-prepared sourmash database format:
```
curl -L https://osf.io/4n3m5/download -o gtdb-r95.nucleotide-k31-scaled1000.sbt.zip
```
(Any sourmash database will do as long as the sequences are named so that the full GenBank accession is the first field in the name.)

### Make a configuration file

Put the following in a config file named `conf-tutorial.yml`:
```
sample:
- SRR5950647
outdir: outputs.tutorial/
metagenome_trim_memory: 1e9
sourmash_database_glob_pattern: gtdb-r95.nucleotide-k31-scaled1000.sbt.zip
```

Notes:
* you can put multiple samples IDs here, in a [YAML array format](https://www.cloudbees.com/blog/yaml-tutorial-everything-you-need-get-started/) - put them on a new line after a dash (`-`).
* if you have multiple databases you can specify them here with an appropriate wild card pattern, e.g. `db/*` will work.
* if you are running this on the farm HPC at UC Davis, you can search all of genbank by *omitting* the database configuration line. Currently these files are not yet publicly available, which is why this tutorial uses GTDB instead.

### Do your first real run!

Execute:
```
genome-grist run conf-tutorial.yml summarize_mapping
```

This will perform the following steps:
* download the [HSMA33MX metagenome](https://www.ncbi.nlm.nih.gov/sra/?term=HSMA33MX) from the Sequence Read Archive (target `download_reads`).
* preprocess it to remove adapters and low-abundance k-mers (target `trim_reads`).
* build a sourmash signature from the preprocess reads. (target `smash_reads`).
* perform a `sourmash gather` against the specified database (target `gather_genbank`).
* download the matching genomes from GenBank into `genbank_genomes/` (target `download_matching_genomes`).
* map the metagenome reads to the various genomes (target `map_reads`).
* produce a summary notebook (target `summarize_mapping`).

## Output files

The key output files under the outputs directory are:

* `genbank/{sample}.x.genbank.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `genbank/{sample}.x.genbank.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `genbank/{sample}.genomes.info.csv` - information about the matching genomes from genbank.
* `reports/report-{sample}.html` - a summary report.
* `abundtrim/{sample}.abundtrim.fq.gz` - trimmed and preprocessed reads.
* `sigs/HSMA33MX.abundtrim.sig` - sourmash signature for the preprocessed reads.

Note that `genome-grist run <config.yml> zip` will create a file named `transfer.zip` with the above files in it.

## Where to insert your own files

genome-grist is built on top of [the snakemake workflow](https://snakemake.readthedocs.io/en/stable/), which lets you substitute your own files in many places.

For example,
* you can put your own `SAMPLE_1.fastq.gz`, `SAMPLE_2.fastq.gz`, and `SAMPLE_unpaired.fastq.gz` files in `raw/` to have genome-grist process reads for you.
* you can put your own interleaved reads file in `abundtrim/SAMPLE.abundtrim.fq.gz` to run genome-grist on a private or preprocessed set of reads;
* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/SAMPLE.abundtrim.sig` if you want to have it do the database search for you;

Please see [the genome-grist Snakefile](https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile) for all the gory details.

## Additional targets

Recommended targets:

* summarize_gather - produce summary reports on metagenome composition
* summarize_tax - produce summary reports on taxonomic composition
* summarize_mapping - produce summary reports on k-mer and read mapping

Note, 'summarize_mapping' includes 'summarize_gather'; reports will be
in {{outdir}}/reports, where 'outdir' is specified in the config file.

Additional intermediate targets:

* download_reads - download SRA metagenomes specified in conf file
* trim_reads - do basic read trimming/adapter removal for metagenome reads
* smash_reads - create sourmash signatures from metagenome reads
* summarize_sample_info - build a info.yaml summary file for each metagenome
* gather_genbank - run 'sourmash gather' on metagenomes against Genbank
* download_matching_genomes - download all matching Genbank genomes
* map_reads - map all metagenome reads to Genbank genomes
* make_sgc_conf - make a spacegraphcats config file

## Other information

### Resource requirements

**Disk space:** genome-grist makes about 4-5 copies of each SRA metagenome.

**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust `metagenome_trim_memory` upwards, which may be needed for complex metagenomes).

**Time:** This is largely dependent on the size of the metagenome; 100m reads takes less than a day or two, typically. The processing of multiple data sets can be done in parallel with `-j`, as well, although you probably want to specify resource limits. For example, here is the command that Titus uses on farm:
```
genome-grist run <config> -k --resources mem_mb=145000 -j 16
```
to run in 150GB of RAM, which will run at most one genbank search at a time.

### Installing unreleased versions.

You can run genome-grist from a git checkout directory by using pip to install it in editable mode:
```
pip install -e .
```

### Support

We like to support our software!

That having been said, genome-grist is early-stage beta-level software. Please be patient and kind :).

Please ask questions and add comments [on the github issue tracker for genome-grist](https://github.com/dib-lab/genome-grist/issues).

## Why the name `grist`?

'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See [Grist in Wikipedia](https://en.wikipedia.org/wiki/Grist).

(It is not the [computing grist](https://en.wikipedia.org/wiki/Grist_(computing))!)
![an example image made with genome-grist](doc/gather-vs-mapping.png)

---

[CTB](https://twitter.com/ctitusbrown/) Jan 27, 2021
[CTB](https://twitter.com/ctitusbrown/) 01/22
13 changes: 13 additions & 0 deletions conf-private.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
samples:
- podar

outdir: outputs.private/

sourmash_databases:
- databases/podar-ref.zip

local_databases_info:
- databases/podar-ref.info.csv

taxonomies:
- databases/podar-ref.tax.csv
Loading