=======================================================
Module for finding Solid (Unique) Kmers from genomic reads-data.
- Either Mac OS X or Linux are currently supported.
- gcc 4.8+ or clang 3.4+
- cmake 3.8+
- KMC3
- Can easily be installed using conda as follows:
conda install -c bioconda kmc
- KMC should be in the path (
$PATH
) for Suk to work.
CmakeLists is provided in the project root folder.
Run the following commands to build a library suk.a
in build/lib
:
git clone --recursive https://github.com/Ritu-Kundu/suk suk
cd suk
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Doptimise_for_native=ON ..
make
Run the following commands to build a binary (executable) suk
in build/bin
:
git clone --recursive https://github.com/Ritu-Kundu/suk suk
cd suk
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release -Dsuk_build_executable=ON -Doptimise_for_native=ON ..
make
Notes:
- If
--recursive
was omitted fromgit clone
, please rungit submodule init
andgit submodule update
before compiling. - If target machine is different from the one on which Hypo is being compiled, exclude the flag
-Doptimise_for_native=ON
.
suk <args>
Mandatory args:
-i, --input <str> Input file name containing reads (in fasta/fastq format; can be compressed). A list of files containing file names in each line can be passed with @ prefix.
-k, --kmer-len <int> The length of the kmer (must be positive).
Optional args:
-o, --output-prefix <str> Prefix of the output filename. [DEAFULT: SUK_k<x> where x is the value of k]
-t, --threads <int> Number of threads. [DEFAULT: 1]
-d, --dump-txt Dumps (in addition to the bit-vector) the unique kmers into textfile (<prefix>.txt where prefix is the value of output-prefix). [DEFAULT: False].
-e, --exclude-hp Excludes the kmers having homopolymer terminals (see README.md). [DEFAULT: False].
-c, --coverage <int> Expected coverage of the reads. KMC counter will be set to 4 x expected coverage. [DEFAULT: 50].
-m, --kmc-memory <int> Max memory usage of KMC (in GB). [DEFAULT: 12].
-h, --help Prints the usage.
- .bv file will be created that stores the bit-vector representing the unique kmers. If no
--output-prefix
(or-o
) is provided,SUK_k<x>.bv
will be created where<x>
is the provided value ofk
. - If text-dump has been activated (using
-d
or--dump-txt
), .txt file will also be created that contains the unique kmers (one in each line).
The tool can be run on the sample data (EColi K12) supplied in the sample
folder, as follows (assuming you are in build
folder):
./bin/suk -k 15 -i @../sample/illumina_files.txt
Here, there are two files for Illumina paired-end reads: SRR1770413_1.fastq and SRR1770413_2.fastq. The names of these two files are written in illumina_files.txt
. The tool will generate and store the bit-vector representing unique kmers of length 15 in the file SUK_k15.bv
in the folder from which the tool has been run.
To dump the kmers in text-format, run the following:
./bin/suk -d -k 15 -i @../sample/illumina_files.txt
This will generate an additional file SUK_k15.txt
in the folder from which the tool has been run.
You can use Suk in your project by linking suk
library (See Building Library (above)
).
An example usage of the library is as follows:
#include "suk/SolidKmers.hpp"
int main(int argc, char **argv) {
uint32_t threads = 1;
unsigned int k = 7;
uint32_t max_memory = 12;
uint32_t coverage = 50;
std::vector<std::string> read_filenames;
read_filenames.push_back("file1"); // first file containing reads
suk::SolidKmers sk(k);
sk.initialise(read_filenames,threads,max_memory, coverage, false);
// Store the bit-vector into a file
sk.store("suk.bv");
// Load the bit-vector from some previously stored file
// suk::SolidKmers sk2(k);
// sk2.load("pvs_stored_file.bv");
// Output the solid unique kmers in text format
sk.dump_txt("suk.txt");
/* Check if an encoded kmer is a solid unique kmer */
uint64_t encoded_kmer = 4u;
if (sk.is_solid(encoded_kmer)) {std::cout << "Kmer is present\n";}
/* Check if a kmer (in text format) is a solid unique kmer */
std::string kmer("AACGTTT"); // kmer of length 7 (=k)
if (sk.is_solid(kmer)) {std::cout << "Kmer is present\n";}
return 0;
}
If you are using CMake for compilation, we recommend adding suk
as a git submodule with the command git submodule add https://github.com/Ritu-Kundu/suk external/suk
. Afterwards, modify your top level CMakeLists.txt
file accordingly:
add_subdirectory(external/suk EXCLUDE_FROM_ALL)
target_link_libraries(your_exe suk)
Add #include "suk/SolidKmers.hpp"
to access the suk API in your source file.
A kmer is a substring of length equal to k. DNA alphabet has four letters/bases:"ACGT". Therefore, there are 4^k
number of distinct kmers possible. DNA alphabet can be represented by 2 bits per base: 00
, 01
, 10
, 11
. Thus each kmer can be represented in 2k
bits. If we encode each base of a kmer into bits, it represents a unique number. Consequently, we have one-to-one correspondence between kmers and numbers of 2k
bits. For example, let k=2.
k=2
# distinct kmers: 4^k = 4^2 = 16
distinct numbers of 2k bits: 2^(2k) = 2^4 = 16
kmers: AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
Encoded kmers: 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 11111
Corres Number: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Thus, we can show the collection of kmers appearing in data as a bit-vector of length 2k
. Each index of the vector represents a kmer; if a kmer is present, we set the bit at the corresponding index; otherwise, reset.
The solid (unique) kmers are those that are expected to appear exactly once in the genome from which reads have sequenced. These can be determined with the help of frequency distribution (histogram) of kmers in the reads. More about the kmer-distribution of a typical genome can be read here.
A homopolymer is a sequence consisting of the same letter repeated multiple times (more than once), e.g. "AAA". Here, a kmer starts in a homopolymer if the first base is same as the second base. Similarly, a kmer ends in a homopolymer if the last base is same as the second-last base. If exclude-homopolymer filter is on (using --exclude-hp
or -e
), we ignore the kmers that start or end in homopolymers.
- The tool produces the frequency histogram (using KMC3).
- For only the kmers with frequency in the range [2,4*coverage]
- It then finds out the threshold frequencies ([lower,upper]) from the histogram so that the kmers with frequency appearing in the range between the thresholds (inclusive) are expected to be unique.
- It first finds out the threshold (error-threshold) such that the kmers with frequency below it are most probably the ones containing errors. Data with frequency below error-threshold is discarded from further consideration.
- Next, it finds out the global-maxima. The frequency at the global-maxima will be taken as the mean-coverage.
- Next, it computes the lower cut-off by scanning on the left of maxima in a look-forward sliding window fashion: Lower cut-off will be a frequency where the number of kmers is lower than most of those with frequency in an adjacent window (of size 5).
- Finally, the upper threshold is determined in the same fashion as that of the lower-threshold but on the right-side of the global maxima.
- There may be some cases where the upper threshold could not be determined in this fashion (because of no uphill curve after maxma); Plan B for such cases is to find the first platue; it works as follows: For every point take average of change (%) from the current value in a window of next 5 values (considering only decreased values). Run moving average (considering a window of size 5) on this range and take the frequency corresponding to the first minimum moving average.
- Lastly, it sets the bits corresponding to all kmers (computed by ntHits) that have frequency <=upper-threshold.
- Before setting, it checks whether the kmer has HomoPolymer ends (the first and second bases are the same or the last and the second last bases are the same). It does not include kmer which has Homopolymer ends, if exclude-homopolymer filter is on (using
--exclude-hp
or-e
).
- Before setting, it checks whether the kmer has HomoPolymer ends (the first and second bases are the same or the last and the second last bases are the same). It does not include kmer which has Homopolymer ends, if exclude-homopolymer filter is on (using
The tool
- Assumes the data to be consisting of DNA alphabet:
A
,C
,G
,T
,N
; (Case insensitive) - Ignores kmers containing
N
. - Kmer-computation is in canonical form (lexicographically smaller amongst the kmer and its reverse-complement). But in the bit-vector, both the solid (unique) kmer and its reverse-complement are set.