By Abolfazl Hashemi, Banghua Zhu, and Haris Vikalo.
- Introduction
- Citation
- MATLAB implementation
- Python implementation
- Input file format
- Output file format
AltHap is a haplotyping assembly method for diploid, polyploid, and polyallelic polyploid organisms. The method formulates the haplotype assembly problem as a sparse tensor decomposition proplem and exploits the special structures of the factors. AltHap is implemented in both Python and MATLAB.
Note
The current implementations assume that the input file contains a connected component. The preprocessing step to handle disconnected componenets will be added in near future.
If you use AltHap in your research, please cite:
@article{hashemi2017sparse,
title={Sparse Tensor Decomposition For Haplotype Assembly Of Diploids And Polyploids},
author={Hashemi, Abolfazl and Zhu, Banghua and Vikalo, Haris},
journal={bioRxiv},
pages={130930},
year={2017},
publisher={Cold Spring Harbor Labs Journals}
}
- Usage in terminal:
matlab -nojvm -r "AltHap(ploidy,'input_file.txt','output_file.txt');exit"
The first two inputs are mandatory. If output_file
is not given, the results will be saved in AltHap-output.txt
.
Please refer to the source code for a detailed explanation of other options which are mainly related to the stopping criteria.
Please refer to the end of this file for a detailed explanation of input and output files format.
- Run a simple example:
In this example, sim0.txt
is the input fragment file, ploidy = 3
, and the results are saved in AltHap-output.txt
.
There are 3 ways to run this example:
-
In terminal write:
matlab -nojvm -r "AltHap(3,'sim0.txt');exit"
-
In terminal write:
matlab -nojvm -nodesktop < ./simple_example.m
-
Run the
AltHapSimple.sh
file in terminal. This file would then run thesimple_example.m
file.
The input file format is as follows
Number of reads
Number of columns
Number_of_contiguos_segments Read_identifier Position_of_first_SNP_segment Continuous_bases_in_read Position_of_next_SNP_segment Continuous_bases_in_read ..... Quality_scores (in fastq format)
- Example for Biallelic:
5568
22801
2 chr22_SPA9_8733 2 0 5 0 ==
1 chr22_SPH2_1940 3 100 C==
- Example for Polyallelic:
4500
1000
2 chr3_1 1 13103 37 1132 IIIIIIIII
2 chr3_2 1 1310321 46 110302 IIIIIIIIIIIII
2 chr3_3 1 3220001 44 3001 IIIIIIIIIII
2 chr3_4 1 322000 40 30023 IIIIIIIIIII
The output file contains MEC score, CPU Time, and Recovered Haplotypes. Subsequently, the phased haplotype is printed in the following format: first haplotype second haplotype third haplotype ...
- Example for Biallelic:
MEC: 353
CPU Time: 38.363051
Recovered Haplotype:
0 1 1
1 0 1
0 0 1
1 1 1
0 1 1
1 0 0
1 0 1
1 1 1
- Example for Polyallelic:
MEC: 51
CPU Time: 28.912201
Recovered Haplotype:
1 1 1
1 0 0
3 3 1
2 2 0
1 3 1
3 0 0
0 0 2
3 2 0
For higher ploidy, there are K phased haplotypes instead of 3.