Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] support local genome collections (including private genomes) #130

Merged
merged 66 commits into from
Jan 17, 2022
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
5e4d7d0
rename rule to sourmash_prefetch_wc
ctb Jan 6, 2022
858e966
start using {outdir}/genomes/
ctb Jan 6, 2022
238f194
swizzle up config to allowprivate_databases and genbank_databases, etc.
ctb Jan 7, 2022
c17a8b5
more progress: copying private genomes around
ctb Jan 7, 2022
692e4fd
combine listing private and genbank genomes - seems to work\!
ctb Jan 8, 2022
aa3641f
simplify the ListGenomes stuff
ctb Jan 8, 2022
5ef6e33
it's aliiiiiiiive
ctb Jan 8, 2022
5c70c56
remove genbank accession requirement
ctb Jan 8, 2022
766cdd5
remove genbank from most filenames, rules
ctb Jan 8, 2022
b3eb9d8
rename minimap to mapping; add clean_gather
ctb Jan 8, 2022
efc78c1
updated to properly (?) use checkpoints throughout
ctb Jan 8, 2022
4deab0a
tests pass locally
ctb Jan 8, 2022
cb8bf37
fix typo
ctb Jan 8, 2022
7b8e8d1
add the beginnings of testing for private databases
ctb Jan 8, 2022
ff04270
getting started
ctb Jan 10, 2022
80fa463
update all the things
ctb Jan 11, 2022
79626e5
[MRG] Change column names in intermediate CSVs. (#133)
ctb Jan 11, 2022
453244e
comments etc.
ctb Jan 12, 2022
c63442f
remove glob pattern, configure genbank_cache
ctb Jan 12, 2022
06210b9
remove 'process' command
ctb Jan 12, 2022
c8a9413
check for old config file params
ctb Jan 12, 2022
c729c71
add important comment
ctb Jan 12, 2022
6fc8d86
actually remove 'process'
ctb Jan 12, 2022
9c5429f
check for 'database_taxonomy' instead of 'taxonomies'
ctb Jan 12, 2022
d381f6d
add trailing / in Makefile
ctb Jan 12, 2022
ea66682
add default taxonomies file to system.conf
ctb Jan 12, 2022
1ac3c58
fix test files
ctb Jan 12, 2022
eb83f2c
fix conf-private.yml
ctb Jan 12, 2022
d0207ad
start of doc/ subdirectory
ctb Jan 14, 2022
a7e19cb
initial commit
hackmd-deploy Jan 13, 2022
e2104a2
add badge
hackmd-deploy Jan 14, 2022
786dcc9
compleat first draft
hackmd-deploy Jan 14, 2022
116702e
minor corrections
hackmd-deploy Jan 14, 2022
d4666fd
spell check
ctb Jan 14, 2022
28ea610
add picklists into the config (#136)
ctb Jan 15, 2022
91949fb
fix 'taxonomies' in test config; check that it's a list
ctb Jan 15, 2022
19692c7
add comment
ctb Jan 15, 2022
cd13ce5
swipe from #97
hackmd-deploy Jan 15, 2022
69f2ea4
Merge branch 'allow/private' of github.com:dib-lab/genome-grist into …
ctb Jan 15, 2022
883e4b1
swipe getting started from #97
ctb Jan 15, 2022
683ea66
update!
hackmd-deploy Jan 15, 2022
4d76d06
Apply suggestions from Taylor's docs review
ctb Jan 15, 2022
07786d4
more update in re taylor's suggestions
hackmd-deploy Jan 15, 2022
429d1ed
more more update
hackmd-deploy Jan 15, 2022
37f62f1
even more update
hackmd-deploy Jan 15, 2022
88ad402
more update
hackmd-deploy Jan 15, 2022
ee8e371
more update
hackmd-deploy Jan 15, 2022
459ee8b
fix help output for CLI
ctb Jan 15, 2022
2d82031
Merge branch 'allow/private' of github.com:dib-lab/genome-grist into …
ctb Jan 15, 2022
0ae63b7
configure mkdocs
ctb Jan 15, 2022
1f9aa1c
clean it out
hackmd-deploy Jan 15, 2022
b456353
update gitignore
ctb Jan 15, 2022
96769e5
add some figures
ctb Jan 15, 2022
7878268
upd
ctb Jan 15, 2022
7f743cc
more figure adjustment
ctb Jan 15, 2022
5bcc054
add badges
hackmd-deploy Jan 16, 2022
679c4ee
simplify to single sourmash_dtabases; use 'local' instead of 'private'
ctb Jan 16, 2022
8900a7d
update to 'local' instead of 'private'
hackmd-deploy Jan 16, 2022
ee552bf
fix extra backquote
hackmd-deploy Jan 16, 2022
0a0dc82
more fix?
hackmd-deploy Jan 16, 2022
03716aa
fix formatting
ctb Jan 16, 2022
c25c815
add tax test
ctb Jan 17, 2022
58bac94
add test for picklist
ctb Jan 17, 2022
d250a46
switch SRR5950647_subset over to use local_databses_info :tada:
ctb Jan 17, 2022
5515be8
cleanup & commenting
ctb Jan 17, 2022
367fbb1
add missing file
ctb Jan 17, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 38 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
all: clean-test test

flakes:
flake8 --ignore=E501 genome_grist/ tests/

black:
black .

clean-test:
rm -fr outputs.test/

Expand All @@ -8,15 +14,41 @@ test:
genome-grist run tests/test-data/SRR5950647.conf summarize_mapping summarize_tax make_sgc_conf -j 8 -p

# try various targets to make sure they work
genome-grist run tests/test-data/SRR5950647.conf download_matching_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf download_matching_genomes_info -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf download_genbank_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf combine_genome_info -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf retrieve_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf estimate_distinct_kmers -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf count_trimmed_reads -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf summarize_sample_info -j 8 -p

### private genomes test stuff

flakes:
flake8 --ignore=E501 genome_grist/ tests/
test-private: outputs.private/abundtrim/podar.abundtrim.fq.gz \
databases/podar-ref.zip databases/podar-ref.info.csv \
databases/podar-ref.tax.csv
genome-grist run conf-private.yml summarize_gather summarize_mapping summarize_tax -j 4 -p

black:
black .
# download the (subsampled) reads for SRR606249
outputs.private/abundtrim/podar.abundtrim.fq.gz:
mkdir -p outputs.private/abundtrim
curl -L https://osf.io/ckbq3/download -o outputs.private/abundtrim/podar.abundtrim.fq.gz

# download the ref genomes
databases/podar-ref/:
mkdir -p databases/podar-ref
curl -L https://osf.io/vbhy5/download -o databases/podar-ref.tar.gz
cd databases/podar-ref/ && tar xzf ../podar-ref.tar.gz

# sketch the ref genomes
databases/podar-ref.zip: databases/podar-ref/
sourmash sketch dna -p k=31,scaled=1000 --name-from-first \
databases/podar-ref/*.fa -o databases/podar-ref.zip

# download taxonomy
databases/podar-ref.tax.csv:
curl -L https://osf.io/4yhjw/download -o databases/podar-ref.tax.csv

# create info file and genomes directory:
databases/podar-ref.info.csv:
python -m genome_grist.copy_private_genomes databases/podar-ref/*.fa -o databases/podar-ref.info.csv -d databases/podar-ref.d
python -m genome_grist.make_info_file databases/podar-ref.info.csv
17 changes: 17 additions & 0 deletions conf-private.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
samples:
- podar

outdir: outputs.private/

genbank_databases: []

private_databases:
- databases/podar-ref.zip

private_databases_info:
- databases/podar-ref.info.csv

taxonomies:
- databases/podar-ref.tax.csv

picklist: xyz.csv::manifest
387 changes: 387 additions & 0 deletions doc/configuring.md

Large diffs are not rendered by default.

71 changes: 71 additions & 0 deletions doc/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Welcome to genome-grist!

genome-grist is software that automates a number of tedious metagenome tasks related to reference-based analyses of Illumina metagenomes. Specifically, genome-grist will download public metagenomes from the SRA, preprocess them, and use [sourmash `gather`](https://sourmash.bio) to identify reference genomes for the metagenome. It will then download the reference genomes, map reads to them, and summarize the mapping in a variety of ways.

## Quickstart

@@ link

## Configuring genome-grist

@@ link

## Example figures and output

@@

## Other information

### Preprints and publications

genome-grist was used for many of the analyses and created a number of the figures in the preprint [Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers](https://dib-lab.github.io/2020-paper-sourmash-gather/), Irber et al., 2022.

This paper is the primary citation for genome-grist. Any use of genome-grist should be cited as follows:

> **Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers.**
>
> Luiz Carlos Irber, Phillip T Brooks, Taylor E Reiter, N Tessa Pierce-Ward, Mahmudur Rahman Hera, David Koslicki, C. Titus Brown.
>
> bioRxiv 2022.01.11.475838; doi: https://doi.org/10.1101/2022.01.11.475838

### Resource requirements

**Disk space:** genome-grist makes about 4-5 copies of each SRA metagenome.

**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust `metagenome_trim_memory` upwards, which may be needed for complex metagenomes).

**Time:** This is largely dependent on the size of the metagenome; 100m reads takes a few hours. The processing of multiple data sets can be done in parallel with `-j`, as well, although you probably want to specify resource limits. For example, here is the command that we use on our HPC:
```
genome-grist run <config> -k --resources mem_mb=145000 -j 16
```
to run in 150GB of RAM, which will run at most one Genbank search at a time.

### Support and help

We like to support our software!

That having been said, genome-grist is still in beta. Please be patient and kind :).

Please ask questions and add comments [on the github issue tracker for genome-grist](https://github.com/dib-lab/genome-grist/issues).

## Why the name `grist`?

'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See [Grist in Wikipedia](https://en.wikipedia.org/wiki/Grist).

(It is not the [computing grist](https://en.wikipedia.org/wiki/Grist_(computing))!)

### Installing in developer mode

You can run genome-grist from a git checkout directory by using pip to install it in editable mode:
```
pip install -e .
```

Or you can pip install the latest version from Github
```
pip install git+https://github.com/dib-lab/genome-grist.git
```

---

[CTB](https://twitter.com/ctitusbrown/) Jan 15, 2022
150 changes: 150 additions & 0 deletions doc/quickstart.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
---
sort: 1
---

# Get started

## Installation

We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is <a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a>).


```sh
conda create -y -n grist python=3.8 pip
conda activate grist
python -m pip install genome-grist
```


## Running genome-grist

We currently recommend running genome-grist in its own directory, for several reasons that include software installation (genome-grist uses snakemake and conda to install software under this directory).

Within the current working directory, genome-grist will create an `inputs` subdir, a `genbank_genomes` subdir, and any `outputs.NAME` subdirectories required by the configuration; it should be straightforward to keep projects separate by configuring the output directories appropriately.

So, create a subdirectory and change into it:
```shell
mkdir grist/
cd grist/
```
Note, genome-grist does not rely on the directory name or location in any way; it works entirely within the current working directory.


### Download a small example database

Download the GTDB release 95 set of ~32k guide genomes, in a pre-prepared sourmash database format:
```
curl -L https://osf.io/4n3m5/download -o gtdb-r95.nucleotide-k31-scaled1000.sbt.zip
```
(Any sourmash database will do as long as the sequences are named so that the full GenBank accession is the first field in the name.)


### Make a configuration file

Put the following in a config file named `conf-tutorial.yml`:
```
sample:
- HSMA33MX
outdir: outputs.tutorial/
metagenome_trim_memory: 1e9
sourmash_database_glob_pattern: gtdb-r95.nucleotide-k31-scaled1000.sbt.zip
```

:information_source: Notes:
* you can put multiple samples IDs here, in a [YAML array format](https://www.cloudbees.com/blog/yaml-tutorial-everything-you-need-get-started/) - put them on a new line after a dash (`-`).
* if you have multiple databases you can specify them here with an appropriate wild card pattern, e.g. `db/*` will work.
* if you are running this on the farm HPC at UC Davis, you can search all of genbank by *omitting* the database configuration line. Currently these files are not yet publicly available, which is why this tutorial uses GTDB instead.



### Do your first real run!

Execute:
```
genome-grist run conf-tutorial.yml summarize
```


This will perform the following steps:
* download the [HSMA33MX metagenome](https://www.ncbi.nlm.nih.gov/sra/?term=HSMA33MX) from the Sequence Read Archive (target `download_reads`).
* preprocess it to remove adapters and low-abundance k-mers (target `trim_reads`).
* build a sourmash signature from the preprocess reads. (target `smash_reads`).
* perform a `sourmash gather` against the specified database (target `gather_genbank`).
* download the matching genomes from GenBank into `genbank_genomes/` (target `download_matching_genomes`).
* map the metagenome reads to the various genomes (target `map_reads`).
* produce a summary notebook (target `summarize`).

The default target is `gather_genbank`, and you can put one or more targets on the command line as above with `summarize`.



## Output files

The key output files under the outputs directory are:

* `genbank/{sample}.x.genbank.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `genbank/{sample}.x.genbank.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `genbank/{sample}.genomes.info.csv` - information about the matching genomes from genbank.
* `reports/report-{sample}.html` - a summary report.
* `abundtrim/{sample}.abundtrim.fq.gz` - trimmed and preprocessed reads.
* `sigs/HSMA33MX.abundtrim.sig` - sourmash signature for the preprocessed reads.

Note that `genome-grist run <config.yml> zip` will create a file named `transfer.zip` with the above files in it.


## Where to insert your own files

genome-grist is built on top of [the snakemake workflow](https://snakemake.readthedocs.io/en/stable/), which lets you substitute your own files in many places.

For example,
* you can put your own `SAMPLE_1.fastq.gz`, `SAMPLE_2.fastq.gz`, and `SAMPLE_unpaired.fastq.gz` files in `raw/` to have genome-grist process reads for you.
* you can put your own interleaved reads file in `abundtrim/SAMPLE.abundtrim.fq.gz` to run genome-grist on a private or preprocessed set of reads;
* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/SAMPLE.abundtrim.sig` if you want to have it do the database search for you;

Please see [the genome-grist Snakefile](https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile) for all the gory details.


## Other information

### Resource requirements

**Disk space:** genome-grist makes about 4-5 copies of each SRA metagenome.

**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust `metagenome_trim_memory` upwards, which may be needed for complex metagenomes).

**Time:** This is largely dependent on the size of the metagenome; 100m reads takes less than a day or two, typically. The processing of multiple data sets can be done in parallel with `-j`, as well, although you probably want to specify resource limits. For example, here is the command that Titus uses on farm:
```
genome-grist run <config> -k --resources mem_mb=145000 -j 16
```
to run in 150GB of RAM, which will run at most one genbank search at a time.


### Installing unreleased versions.

You can run genome-grist from a git checkout directory by using pip to install it in editable mode:
```
pip install -e .
```

Or you can pip install the latest version from Github
```
pip install git+https://github.com/dib-lab/genome-grist.git
```

### Support

We like to support our software!

That having been said, genome-grist is early-stage beta-level software. Please be patient and kind :).

Please ask questions and add comments [on the github issue tracker for genome-grist](https://github.com/dib-lab/genome-grist/issues).

## Why the name `grist`?

'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See [Grist in Wikipedia](https://en.wikipedia.org/wiki/Grist).

(It is not the [computing grist](https://en.wikipedia.org/wiki/Grist_(computing))!)

---

[CTB](https://twitter.com/ctitusbrown/) Jan 27, 2021
Loading