Skip to content

Latest commit

 

History

History
154 lines (102 loc) · 6.21 KB

README.md

File metadata and controls

154 lines (102 loc) · 6.21 KB
                .ooooo.   .ooooo.   .oooo.o ooo. .oo.  .oo.    .ooooo.  
               d88' `"Y8 d88' `88b d88(  "8 `888P"Y88bP"Y88b  d88' `88b 
               888       888   888 `"Y88b.   888   888   888  888   888 
               888   .o8 888   888 o.  )88b  888   888   888  888   888 
               `Y8bod8P' `Y8bod8P' 8""888P' o888o o888o o888o `Y8bod8P' 
                                                              ver 0.5.1

Cosmo

Version: 0.5.1

Cosmo is a fast, low-memory DNA assembler that uses a succinct de Bruijn graph.

Usage

After compiling, you can run Cosmo like so:

$ pack-edges <input_file> # this adds reverse complements and dummy edges, and packs them
$ cosmo-build <input_file>.packed # compresses and builds indices
$ cosmo-assemble <input_file>.packed.dbg # output: <input_file>.packed.dbg.fasta # NOT IMPLEMENTED YET

Where input_file is the binary output of a DSK run. Each program has a --help option for a more detailed description of how to use them.

Colored de Bruijn graph usage:

$ cosmo-pack -c kmer_counts.ctx # read cortex binary file format of kmer counts, writes .colors file
$ pack-color  [-o <output_prefix>] [--] [--version] [-h] <input_file>  <num colors> # convert a "color file" (a sequence of 64 bit ints, one per edge) to an SDSL::rrr_vector
$ cosmo-color  [-b <color_mask2>] [-a <color_mask1>] [-o <output_prefix>] [--] [--version] [-h] <input_file>  <color_file> # reads rrr

practical example:

$ cd /s/oak/b/nobackup/muggli/src/CORTEX_release_v1.0.5.21/demo/example4_using_reference_genome_to_exclude_paralogs
$ ../../bin/cortex_var_31_c2 --kmer_size 17 --colour_list colours  --dump_binary both.ctx
$ cd ~/git/cosmo
$ ./cosmo-pack -c /s/oak/b/nobackup/muggli/src/CORTEX_release_v1.0.5.21/demo/example4_using_reference_genome_to_exclude_paralogs/both.ctx
$ ./pack-color both.ctx.colors 2
$ ./cosmo-color both.ctx.packed both.ctx.colors.rrr

Caveats

Here are some things that you don't want to let surprise you:

DSK Only

Currently Cosmo only supports DSK files with k <= 64 (so, 128 bit or less blocks). Support is planned for DSK files with larger k, and possibly output from other k-mer counters.

Definition of "k-mer"

Note that since our graph is edge-based, k defines the length of our edges, hence our nodes are only k-1 symbols long. If you want to construct a Succinct de Bruijn Graph where the nodes are k-mers, you will need to run DSK with k set to k+1. E.g. using output from $ dsk <input_file> 27 will actually build a 26-dimension de Bruijn graph.

Note: Both even and odd k values should work with this assembler due to our loop-immune traversal.

Furthermore, most de Bruijn graph based assemblers add edges between all nodes that overlap. Instead, we are taking the k-mers as our edges (of two k-1-length nodes), so we only have edges that were directly represented in the read set (this makes more sense to us, though, as it reduces unnecessary branching). I may add support for the standard way in the future if anyone wants it (it would be similar to the dummy edge adding code).

Graph Traversal

We currently only output the unitigs (paths between branching nodes).

Compilation

There is an included Makefile - just type make to build it (assuming you have the dependencies listed below). To build with "Variable order mode", use the varord=1 flag.

Dependencies

  • A compiler that supports C++11,
  • Boost - ranges and range algorithms, zip iterator, tuple comparison, lots of good stuff,
  • SDSL-lite - low level succinct data structures (For now you will have to use my branch if you want to use variable order graphs: clone this and checkout the develop branch before compiling),
  • TClap - command line parsing,
  • DSK - k-mer counting (we need this for input),
  • Optionally (for developers): Python and NumPy - rebuilding the lookup tables,
  • STXXL - external merging (not actually required yet though)

Many of these are all installable with a package manager (e.g. (apt-get | yum | brew ) install boost libstxxl tclap). However, you will have to download and build these manually: DSK and SDSL-lite.

Authors

Implemented by Alex Bowe. Original concept and prototype by Kunihiko Sadakane.

These people also proved incredibly helpful: Rayan Chikhi, Simon Puglisi, Travis Gagie, Christina Boucher, Simon Gog, Dominik Kempa.

Contributing

Your help is more than welcome! Please fork and send a pull request, or contact me directly :)

Why "Cosmo"?

Cosmos /ˈkɑz.moʊs/ (n) : "An ordered, harmonious whole.".

If that doesn't suit an assembly program then I don't know what does. The last s was dropped because it was nicer to say. Furthermore, it is a reference to the Seinfeld character Cosmo Kramer (whose last name I'm often reminded of while working on this stuff).

License

This software is copyright (c) Alex Bowe 2014 (bowe dot alexander at gmail dot com). It is released under the GNU General Public License (GPL) version 3.