Nextstrain build pipeline for the WestNile 4K Project
This is the repository used to build nextstrain.org/WNV/NA
This repository contains the steps to use augur to build the WNV/NA dataset.
-
Install conda
-
Install augur (and its dependencies) into a conda environment
git clone [email protected]:nextstrain/augur.git # the nextstrain bioinformatics toolkit
cd augur
conda env create -f environment.yml
export NCBI_EMAIL=<YOUR_EMAIL_HERE>
This creates the conda environment augur
which we must be in for all remaining steps
- Enable the conda environment
source activate augur
- Install auspice
conda install -c conda-forge nodejs
npm install --global auspice
- Clone this repository
git clone [email protected]:grubaughlab/WNV-nextstrain.git
cd WNV-nextstrain
- Check augur & auspice are installed:
augur -h
auspice -h
Snakefile
- contains the augur / WNV-custom steps to run the build. Each snakemake command can be run as a bash command on it's own, but we use snakemake to simplify things../data/*
- the input files (private, and not committed to github). You are responsible for creating the two required files here:./data/full_dataset.fasta
and./data/headers.csv
(these are referenced in theSnakefile
)../scripts/*
custom WNV scripts. Called by commands in the `Snakefile../results/
augur will produce a number of (intermediate) files including the alignment, newick trees etc. Not committed to github../auspice/
will contain the JSONs necessary for visualisation by auspice.
The Snakefile
details each step in the buil (See that file for the specifics).
As such, it should be as simple as running
snakemake clean # remove any files from a previous build
snakemake # run the build pipeline. Takes about 40min
and the entire build will run through.
It's worth explaining some of the commands here, many of which are quick and can be re-run on their own to change the output. (For instance, changing colours doesn't require you to re-run the tree building steps.)
The commands listed will re-run just those steps -- so it's best to have run through the entire Snakefile
before tweaking steps. Note that you'll also have to run snakemake --printshellcmds --force export
to regenerate the auspice JSONs for viewing.
snakemake clean #
snakemake --printshellcmds --force parse
snakemake --printshellcmds --force add_authors
snakemake export # will run all the remaining steps
Parses the input CSV + FASTA -- this involves parsing the dates, interpreting the header of the CSV etc etc.
The authors are added by a mixture of pattern matching strain names, as well as querying entrez for author information.
The latter step is slow, and so a cache is created at ./results/author_cache.tsv
so that repeating this step can run faster.
See ./scripts/add_authors.py
for more information.
snakemake --printshellcmds --force create_colors
snakemake --printshellcmds --force export
This uses the ./scripts/make_colors.py
script to dynamically generate a colour palette.
Please edit this file to make changes to the colour scheme.
snakemake --printshellcmds --force create_lat_longs
snakemake --printshellcmds --force export
This uses the ./scripts/create_lat_longs.py
script to dynamically generate the lat-longs based on the contents of the metadata file.
Currently all the states are hardcoded here (only those present in the metadata are actually exported tho), and the divisions are created dynamically by averaging the GPS values provided for each sample. The latter approach may wish to be improved.
From within the current directory, simply run auspice view --datasetDir ./auspice
and then load http://localhost:4000/ in a browser to see the results 🎉
Currently this has to be done from the bedford lab