Skip to content

stfc/histogramSketcher

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Histosketching Using Little Kmers


travis Documentation Status reportcard DOI Binder

Overview

HULK is a tool that creates small, fixed-size sketches from streaming microbiome sequencing data, enabling rapid metagenomic dissimilarity analysis. HULK generates a k-mer spectrum from a FASTQ data stream, incrementally sketches it and makes similarity search queries against other microbiome sketches.

It works by using count-min sketching to create a k-mer spectrum from a data stream. After some reads have been added to a k-mer spectrum, HULK begins to process the counter frequencies and populates a histosketch. Similarly to MinHash sketches, histosketches can be used to estimate similarity between microbiome samples.

The advantages of HULK include:

  • it's fast and can run on a laptop in minutes
  • hulk sketches are compact and a fixed size
  • it works on data streams and does not require complete data instances
  • it can use concept drift for histosketching
  • you get to type hulk smash into the command line...

Finally, you can use hulk sketches to with a Machine Learning classifier to bin microbiome samples (see BANNER). More info on this coming soon...

Installation

Check out the releases to download a binary. Alternatively, install using Bioconda or compile the software from source.

Bioconda

conda install hulk

note: if using Conda make sure you have added the Bioconda channel first

Source

HULK is written in Go (v1.10) - to compile from source you will first need the Go tool chain. Once you have it, try something like this to compile:

# Clone this repository
git clone https://github.com/will-rowe/hulk.git

# Go into the repository and get the package dependencies
cd hulk
go get -d -t -v ./...

# Run the unit tests
go test -v ./...

# Compile the program
go build ./

# Call the program
./hulk --help

Quick Start

HULK is called by typing hulk, followed by the subcommand you wish to run. There are three main subcommands: sketch, distance and smash. This quick start will show you how to get things running but it is recommended to follow the documentation.

# Create a hulk sketch
gunzip -c microbiome.fq.gz | hulk sketch -p 8 -o sampleA

# Get similarity measures between two hulk sketches
hulk distance -1 sampleA.sketch -2 sampleB.sketch

#  Get a pairwise Jaccard Similarity matrix for a set of hulk sketches
hulk smash --jsMatrix -d ./dir-with-sketches-in/ -o my-jsMatrix

# Create a sketch matrix to train a Random Forest Classifier (see banner)
## smash all the sketches from one sample type (labeled 0)
hulk smash --bannerMatrix -o abx-treatedx -l 0
## smash all the sketches from another sample type (labeled 1), this time recursively
hulk smash  --bannerMatrix --sketchDir ./no-abx-sketches --recursive -o no-abx -l 1
# join both samples into one matrix
cat abx-treated.banner-matrix.csv no-abx.banner-matrix.csv > training.csv

# Train a Random Forest Classifier (make sure you have banner)
conda install banner
banner train --matrix training.csv

# Predict!
hulk sketch -f mystery-sample.fastq --stream -p 8 | banner predict -m banner.rfc

Further Information & Citing

Please readthedocs for more extensive documentation and a tutorial will be forthcoming.

A paper describing HULK is published in Microbiome:

Rowe WPM et al. Streaming histogram sketching for rapid microbiome analytics. Microbiome. 2019.

Packages

No packages published

Languages

  • Go 100.0%