HULK
is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK
generates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.
It works by using count-min sketching to create a k-mer spectrum from a data stream. After some reads have been added to a k-mer spectrum, HULK
begins to process the counter frequencies and populates a histosketch. Similarly to MinHash sketches, histosketches can be used to estimate similarity between microbiome samples.
The advantages of HULK
include:
- it's fast and can run on a laptop in minutes
- hulk sketches are compact and a fixed size
- it works on data streams and does not require complete data instances
- it can use concept drift for histosketching
- you get to type
hulk smash
into the command line...
Finally, you can use hulk sketches to with a Machine Learning classifier to bin microbiome samples (see BANNER). More info on this coming soon...
Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.
conda install hulk
note: if using Conda make sure you have added the Bioconda channel first
HULK
is written in Go (v1.10) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:
# Clone this repository
git clone https://github.com/will-rowe/hulk.git
# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...
# Run the unit tests
go test -v ./...
# Compile the program
go build ./
# Call the program
./hulk --help
HULK
is called by typing hulk, followed by the subcommand you wish to run. There are three main subcommands: sketch, distance and smash. This quick start will show you how to get things running but it is recommended to follow the documentation.
# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -p 8 -o sampleA
# Get similarity measures between two hulk sketches
hulk distance -1 sampleA.sketch -2 sampleB.sketch
# Get a pairwise Jaccard Similarity matrix for a set of hulk sketches
hulk smash --jsMatrix -d ./dir-with-sketches-in/ -o my-jsMatrix
# Create a sketch matrix to train a Random Forest Classifier (see banner)
## smash all the sketches from one sample type (labeled 0)
hulk smash --bannerMatrix -o abx-treatedx -l 0
## smash all the sketches from another sample type (labeled 1), this time recursively
hulk smash --bannerMatrix --sketchDir ./no-abx-sketches --recursive -o no-abx -l 1
# join both samples into one matrix
cat abx-treated.banner-matrix.csv no-abx.banner-matrix.csv > training.csv
# Train a Random Forest Classifier (make sure you have banner)
conda install banner
banner train --matrix training.csv
# Predict!
hulk sketch -f mystery-sample.fastq --stream -p 8 | banner predict -m banner.rfc
Please readthedocs for more extensive documentation and a tutorial will be forthcoming.
A paper describing HULK
is published in Microbiome:
Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019.