Version 1.3 release. This release includes updates to signal normaliz…

…ation procedure (including sequence-dependent iterative re-scaling) and an outlier-robust alternative model comparison method. These updates drastically increase the accuracy of Tombo modified base predictions. Many other computational optimizations and bug fixes. Added 5mC RNA model (fixes #50). Updated RNA model to 180mV model (fixes #63). Increased read filtering capabilities. Re-factored Tombo commands into command groups.
nanoporetech · May 23, 2018 · 24eac94 · 24eac94
1 parent dc247fa
commit 24eac94
Show file tree

Hide file tree

Showing 41 changed files with 4,406 additions and 2,808 deletions.
diff --git a/README.rst b/README.rst
diff --git a/docs/_images/outlier_robust_llr.gif b/docs/_images/outlier_robust_llr.gif
diff --git a/docs/_images/per_read_stat_dist.png b/docs/_images/per_read_stat_dist.png
diff --git a/docs/_images/roc.png b/docs/_images/roc.png
diff --git a/docs/_images/stat_dist.png b/docs/_images/stat_dist.png
diff --git a/docs/conf.py b/docs/conf.py
@@ -42,7 +42,7 @@
 # General information about the project.
 __pkg_name__ = u'tombo'
 project = __pkg_name__.capitalize()
-copyright = u'2017, Oxford Nanopore Technologies'
+copyright = u'2017-18, Oxford Nanopore Technologies'
 
 # Generate API documentation:
 if subprocess.call(['sphinx-apidoc', '-o', './', "../{}".format(__pkg_name__)]) != 0:

diff --git a/docs/examples.rst b/docs/examples.rst
diff --git a/docs/filtering.rst b/docs/filtering.rst
@@ -2,30 +2,52 @@
 Read Filtering Commands
 ***********************
 
-Read filtering commands can be useful to extract the most out out of a set of reads for modified base detection. Read filtering commands effect only the Tombo index file, and so filters can be cleared or applied iteratively without re-running any re-squiggle analysis. Two filters are currently made available (``filter_stuck`` and ``filter_coverage``).
+Read filtering commands can be useful to extract the most out out of a set of reads for modified base detection. Read filtering commands effect only the Tombo index file, and so filters can be cleared or applied iteratively without re-running the re-squiggle command. Five filters are currently made available (``genome_locations``, ``raw_signal_matching``, ``q_score``,  ``level_coverage`` and ``stuck``).
 
-----------------
-``filter_stuck``
-----------------
+---------------------------
+``filter genome_locations``
+---------------------------
 
-The ``filter_stuck`` command aims to remove reads where bases tend to apparently get stuck in the pore for longer durations of time. These reads can be indicative of poor quality reads and thus negatively effect modified base detection.
+The ``filter genome_locations`` command filters out reads falling outside of a specified set of ``--include-regions``. These regions can either be whole chromosomes/sequence records or sub-regions within sequence records.
 
-This filter is based on the number of observations per genomic base along a read. The filter can be set on any number of percentiles of obervations per base. Reasonable values depend strongly on the sample type (DNA or RNA). A reasonable filter for DNA reads would be to filter reads with 99th percentile > 200 obs/base or a maximum base with > 5k obs/base. This filter would be set with the ``--obs-per-base-filter 99:200 100:5000`` option. Larger values should be used for RNA reads.
+------------------------------
+``filter raw_signal_matching``
+------------------------------
+
+The ``filter raw_signal_matching`` command filters out reads with poor matching between raw observed signal and expected signal levels from the canonical base model. Specify a new threshold to apply with the ``--signal-matching-score`` option. These scores are the mean half z-score (absolute value of z-score) taken over all bases of a read. A reasonable range for this threshold should be approxiamtely between 0.5 and 3. Reads with a larger fraction of modifications may require a larger value to process successfully.
 
--------------------
-``filter_coverage``
--------------------
+------------------
+``filter q_score``
+------------------
 
-The ``filter_coverage`` command aims to filter reads to achieve more even read depth across a genome. This may be useful particularly in canonical and particularly in alternative model estimation. This filter may also help make test statistics more comparable across the genome.
+The ``filter q_score`` command filters out reads with poor mean basecalling quality scores. This value can be indicative of low quality reads. Set this value with the ``--q-score`` option.
+
+-------------------------
+``filter level_coverage``
+-------------------------
+
+The ``filter level_coverage`` command aims to filter reads to achieve more even read depth across a genome/transcriptome. This may be useful in canonical and alternative model estimation. This filter may also help make test statistics more comparable across the genome.
 
 This filter is applied by randomly selecting reads weighted by the approximate coverage at the mapped location of each read. The number of reads removed from downstream processing is defined by the ``--percent-to-filter`` option.
 
 This filter is likely to be more useful for PCR'ed sample where duplicate locations are more likely to accumulate and cause large spikes in coverage.
 
------------------
-``clear_filters``
------------------
+----------------
+``filter stuck``
+----------------
+
+The ``filter stuck`` command aims to remove reads where bases tend to get stuck in the pore for longer durations of time. These reads can be indicative of poor quality reads and thus negatively effect modified base detection.
+
+This filter is based on the number of observations per genomic base along a read. The filter can be set on any number of percentiles of obervations per base. Reasonable values depend strongly on the sample type (DNA or RNA). A reasonable filter for DNA reads would be to filter reads with 99th percentile > 200 obs/base or a maximum base with > 5k obs/base. This filter would be set with the ``--obs-per-base-filter 99:200 100:5000`` option. Larger values should be used for RNA reads.
 
-The ``clear_filters`` simply removes any applied filters to this sample (failed reads from the re-squiggle command will still not be included). New filters can then be applied to this set of reads.
+------------------------
+``filter clear_filters``
+------------------------
+
+The ``filters clear_filters`` command removes any applied filters to this sample (including those applied during the ``resquiggle`` command; though reads that failed before signal to sequence assignment will not be included). New filters can then be applied to this set of reads.
 
 All Tombo sub-commands will respect the filtered reads when parsed for processing.
+
+.. hint::
+
+   Save a set of filters for later use by copying the Tombo index file: ``cp path/to/native/rna/.fast5s.RawGenomeCorrected_000.tombo.index save.native.tombo.index``. To re-set to a set of saved filters after applying further filters simply replace the index file: ``cp save.native.tombo.index path/to/native/rna/.fast5s.RawGenomeCorrected_000.tombo.index``.
diff --git a/docs/index.rst b/docs/index.rst
@@ -31,19 +31,63 @@ Basic tombo installation (python 2.7 and 3.4+ support)
 
 See :doc:`examples` for common workflows.
 
--------------
-Documentation
--------------
+===========
+Quick Start
+===========
 
-Run ``tombo -h`` to see all Tombo sub-commands and run ``tombo [sub-command] -h`` to see the options for any Tombo sub-command.
+Call 5mC and 6mA sites from raw nanopore read files. Then output genome browser `wiggle format file <https://genome.ucsc.edu/goldenpath/help/wiggle.html>`_ for 5mA calls and plot raw signal around most significant 6mA sites.
 
-Detailed documentation for all Tombo algorithms and sub-commands can be found through the links here.
+::
+
+   # skip this step if FAST5 files already contain basecalls
+   tombo preprocess annotate_raw_with_fastqs --fast5-basedir path/to/fast5s/ \
+       --fastq-filenames basecalls1.fastq basecalls2.fastq \
+       --sequencing-summary-filenames seq_summary1.txt seq_summary2.txt \
+       --processes 4
+   
+   tombo resquiggle path/to/fast5s/ genome.fasta --processes 4
+   tombo detect_modifications alternative_model --fast5-basedirs path/to/fast5s/ \
+       --statistics-file-basename sample.alt_modified_base_detection \
+       --per-read-statistics-basename sample.alt_modified_base_detection \
+       --processes 4
+   
+   # produces sample.alt_modified_base_detection.5mC.dampened_fraction.[plus|minus].wig files
+   tombo text_output --statistics-filename sample.alt_modified_base_detection.5mC.tombo.stats \
+       --browser-file-basename sample.alt_modified_base_detection.5mC --file-types dampened_fraction
+   
+   # plot raw signal at most significant locations
+   tombo plot most_significant --fast5-basedirs path/to/fast5s/ \
+       --statistics-filename sample.alt_modified_base_detection.6mA.tombo.stats \
+       --plot-standard-model --plot-alternate-model 6mA \
+       --pdf-filename sample.most_significant_6mA_sites.pdf
+
+Detect any deviations from expected signal levels for canonical bases to investigate any type of modification.
+
+::
+
+   tombo resquiggle path/to/fast5s/ genome.fasta --processes 4
+   tombo detect_modifications de_novo --fast5-basedirs path/to/fast5s/ \
+       --statistics-file-basename sample.de_novo_modified_base_detection \
+       --per-read-statistics-basename sample.de_novo_modified_base_detection \
+       --processes 4
+   
+   # produces sample.de_novo_modified_base_detection.dampened_fraction.[plus|minus].wig files
+   tombo text_output --statistics-filename sample.de_novo_modified_base_detection.tombo.stats \
+       --browser-file-basename sample.de_novo_modified_base_detection --file-types dampened_fraction
+
+.. note::
+
+   All of these commands work for RNA data as well, but a transcriptome reference sequence must be provided for spliced transcripts.
+
+   Run ``tombo -h`` to see all Tombo command groups, run ``tombo [command-group] -h`` to see all commands within each group and run ``tombo [command-group] [comand] -h`` for help with arguments to each Tombo command.
+
+   Detailed documentation for all Tombo algorithms and commands can be found through the links here.
 
 ------
 Naming
 ------
 
-Tombo Ahi is a Japanese name for albacore (which is also the Oxford Nanopore Technologies basecaller). So use albacore to identify canonical bases and then use Tombo to detect more exotic, non-canonical bases.
+Tombo Ahi is a Japanese name for albacore (the name of the Oxford Nanopore Technologies basecaller). So use albacore to identify canonical bases and then use Tombo to detect more exotic, non-canonical bases.
 
 --------
 Contents
@@ -58,8 +102,8 @@ Contents
    text_output
    plotting
    filtering
-   model_training
    rna
+   model_training
 
 -------------------------
 Full API reference (beta)