tracer

This code implements a deep LSTM neural network for basecalling from raw PacBio single-molecule real-time (SMRT) instrument "traces". In short, in the process of determining the sequence of the DNA in an input sample, the Pacific Biosciences sequencer emits [number] of parallel signals of this sequence, as a four-channel time series (one channel corresponding to each of A, T, C, and G) which must be "called" into a sequence string (e.g. "ATCTGAGTACCATGACATG..."). The single-pass error rate of the PacBio sequencer is currently around 13%. An improvement in the error rate of the platform would be of significant value to users of the platform enabling significant cost reductions and more powerful inquiry.

Installation

System setup

On Mac OSX,

brew install homebrew/science/hdf5

On Linux,

sudo apt-get install libhdf5-dev

Environment setup

virtualenv venv
source venv/bin/activate

TensorFlow setup

(from the TensorFlow documentation)

Ubuntu/Linux 64-bit, CPU only:

$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.8.0rc0-cp27-none-linux_x86_64.whl

Ubuntu/Linux 64-bit, GPU enabled. Requires CUDA toolkit 7.5 and CuDNN v4. For other versions, see "Install from sources" below.

$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.8.0rc0-cp27-none-linux_x86_64.whl

Mac OS X, CPU only:

$ sudo easy_install --upgrade six
$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/mac/tensorflow-0.8.0rc0-py2-none-any.whl

Running on AWS

We found it was a challenge to configure tensorflow to leverage GPU's on AWS g2.4xlarge instances but included a script describing how we did it.

Installation

From the root of the repo:

make

Training and Usage

Training

tracer-train --data_dir=$TRACERDIR/data --train_dir=$TRACERDIR/data/checkpoints --size=256 --num_layers=3 --in_vocab_size=20000

Base calling

tracer-decode --model=[path to model file] --input=[path to input traces] --output=[path to which to write output]

Evaluation

tracer-eval --inputCalls=[path to input traces] --inputKey=[path to input traces] --output=[path to which to write output]

Example decodings

As development progresses, we hopefully will see the quality of decodings improve. Here are some of the current rather terrible decodings. Obviously there's a long way to go.

1 layer, 10 neurons per layer, 5min
decoded: ACAAAAA
correct: TCAGCCGAACGAAGTCGCGATGCAGCCCAGTGGGATGAAACGGTCGATCGGCTCTCTACGCTACTTGAGATTAAAAAGATTTGGTGTGAGGTTGCTCGGTTTAGGTCTAC

1 layer, 10 neurons per layer, 5min
decoded: TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
correct: AATCGGGGAGACCTGCGCTTGTCGGCGCTCGTACACGATTTTTCTTACGAGCATGTTATTCGACGCCAGACATGAAGATTTCGGGATCGCTCGAAGTCTATTCAAAGTGA

3 layers, 256 neurons per layer, ~2h
decoded: TTTTA
correct: TCAGCCGAACGAAGTCGCGATGCAGCCCAGTGGGATGAAACGGTCGATCGGCTCTCTACGCTACTTGAGATTAAAAAGATTTGGTGTGAGGTTGCTCGGTTTAGGTCTAC

Conclusion

With respect to the original goal of improving the basecall error rate beyond the current state of the art single-pass error rate of 13%, this experiment is so far not a success.

License

tracer is released under the Apache License 2.0. See LICENSE. The majority of the code are modifications of the seq2seq example from the Tensor Flow library, which is covered by their LICENSE. If you have any suggestions about how to more appropriately provide attribution on the individual source files, let me know. I'm unsure, for example, whether the original copyright notice should be retained on each file.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
tracer		tracer
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
LICENSE.tflow		LICENSE.tflow
Makefile		Makefile
README.md		README.md
ec2.sh		ec2.sh
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

tracer

Installation

System setup

Environment setup

TensorFlow setup

Ubuntu/Linux 64-bit, CPU only:

Ubuntu/Linux 64-bit, GPU enabled. Requires CUDA toolkit 7.5 and CuDNN v4. For other versions, see "Install from sources" below.

Mac OS X, CPU only:

Running on AWS

Installation

Training and Usage

Training

Base calling

Evaluation

Example decodings

Conclusion

License

About

Licenses found

Releases

Packages

Languages

License

Licenses found

cwbeitel/tracer

Folders and files

Latest commit

History

Repository files navigation

tracer

Installation

System setup

Environment setup

TensorFlow setup

Ubuntu/Linux 64-bit, CPU only:

Ubuntu/Linux 64-bit, GPU enabled. Requires CUDA toolkit 7.5 and CuDNN v4. For other versions, see "Install from sources" below.

Mac OS X, CPU only:

Running on AWS

Installation

Training and Usage

Training

Base calling

Evaluation

Example decodings

Conclusion

License

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages