diff --git a/doc/paper.org b/doc/paper.org index b721b00..1cf06e4 100644 --- a/doc/paper.org +++ b/doc/paper.org @@ -15,19 +15,20 @@ Email: zilong.dk@gamil.com.}} methods to manipulate it in different programming languages. Static languages like C++ has the superiority of performance, and modern C++ standards strive to provide growing standard library for - developing program quickly and easily. In this work. we present - vcfpp, a C++ API of htslib, providing a rapid and intuitive - interface to manipulate VCF/BCF files safely. The simplicity and - portability makes vcfpp working with scripting language like python - and R easily. An example demonstrates how minimal vcfpp code can - manipulate variant records rapidly. Additionally, another example - demonstrates how vcfpp can be useful to Statisticians analyzing - genetic variants in R. In the Benchmarking, a simple R script using - vcfpp API shows faster speed than script using cyvcf2 when streaming - variant analysis. And in a two-step setting, a R script using vcfpp - has faster speed of both loading VCF data and processing genotypes - than the R scripts using vcfR and data.table. Finally, some useful - command line tools using vcfpp are available publicly. + developing program quickly and easily. In this work, I present + vcfpp, a C++ API of htslib in a single header file, providing an + intuitive interface to manipulate VCF/BCF files rapidly and + safely. An example demonstrates how minimal vcfpp code can + manipulate variant records rapidly. Additionally, the simplicity and + portability makes vcfpp working with scripting language like R + easily. The second example demonstrates how vcfpp can be useful to + Statisticians analyzing genomics variations in R. In the + Benchmarking, a simple R script using vcfpp API shows faster speed + than script using cyvcf2 when streaming variant analysis. And in a + two-step setting, a R script using vcfpp has faster speed of both + loading VCF contents and processing genotypes than the R scripts + using vcfR and data.table. Finally, some useful command line tools + using vcfpp are available. \end{titlepage} #+end_export @@ -43,24 +44,23 @@ efficiently. The C API of htslib provide the full functionalities to manipulate VCF and BCF format for both compressed and uncompressed files. As the C API is hard to use for in-proficient programmers, API of other languages are invented. Existing popular libraries -include vcfR[cite:@brian2017] for R, cyvcf2 for python -[cite:@pedersen2017a], hts-nim [cite:@pedersen2018] for nim and -vcflib for C++ [cite:@garrison2022]. All are useful for -corresponding language community and have both advantages and -disadvantages. In particular, vcflib is both an API and a large -collection of command line tools, with the main disadvantage being -not supporting BCF file. It is noteworthy that many methods written -in C++ designed for large sample size can not even input the BCF -file which is tailed for big data like Syllbable-PBWT -[cite:@wang2023]. The motivation of vcfpp is to offer full -functionalities as htslib and provide simple and safe API in a -single header file that can be easily integrated for writing scripts -quickly in C++, python and R. +include vcfR[cite:@brian2017] for R, cyvcf2 [cite:@pedersen2017a] +for Python, hts-nim [cite:@pedersen2018] for Nim and vcflib +[cite:@garrison2022] for C++. All are useful for corresponding +community, having both advantages and disadvantages. In particular, +vcflib is both an API and a large collection of command line tools, +with the main disadvantage being not supporting BCF files. It is +noteworthy that many methods written in C++ designed for large +sample size can not even input the BCF file which is tailed for big +data like Syllbable-PBWT [cite:@wang2023]. The motivation of vcfpp +is to offer full functionalities as htslib and provide simple and +safe API in a single header file that can be easily integrated for +writing scripts quickly in C++ and R. * Features vcfpp is implemented as a single header file for being easily -intergrated and compiled. There are four core class for manipulating +integrated and compiled. There are four core class for manipulating uncompressed and compressed VCF/BCF as summarized in Table [[tb:class]]. #+caption: vcfpp capabilities and implemented C++ class @@ -76,12 +76,12 @@ uncompressed and compressed VCF/BCF as summarized in Table [[tb:class]]. * Usage -In an effort to demonstrate the power and performance of vcfpp, the -following sections highlight typical VCF analyses and illustrate -commonly used features in vcfpp. Other examples and further details -of the vcfpp can be found at https://github.com/Zilong-Li/vcfpp. +To demonstrate the power and performance of vcfpp, the +following sections illustrate commonly used features of vcfpp and +highlight the portability of vcfpp in R. Other examples and further +details of vcfpp can be found at https://github.com/Zilong-Li/vcfpp. -** Python-like API +** C++ API In this example, we count the number of heterozygous sites for each sample in all records. The following code first includes a @@ -90,16 +90,16 @@ constrained to 3 samples and "chr21" region (line 2), and creates a variant record object associated with the header information in the BCF (line 3). Then it defines several types of objects to collect the results we want (line 4-6). Taking advantage of -generic templates in C++, type of field like genotype can be very -flexible. Then it starts iterating over the BCF file and process -each variant record in the loop (line 7). We ignore variant of -other types (INDEL, SV), with missing genotypes and with QUAL -value smaller than 9, while we can also retrieve more information -of the variant and do filtering (line 9-10). Finally, we count the -heterozygous sites for each diploid sample (line 11-13). The core -is just 13 lines. - -#+caption: Counting heterozygous genotypes per sample on chr21 +generic templates in C++, the type of field like genotype can be +very flexible. Then it starts iterating over the BCF file and +processes each variant record in the loop (line 7). We ignore +variants of other types (INDEL, SV), or with missing genotypes, or +with QUAL value smaller than 9, while we can also retrieve more +information of the variant and do filtering (line 9-10). Finally, +we count the heterozygous sites for each diploid sample (line +11-13). The core is just 13 lines. + +#+caption: Counting the number of heterozygous genotypes for 3 samples on chr21 #+attr_latex: :options captionpos=t #+begin_src C++ -n #include @@ -125,14 +125,16 @@ while(vcf.getNextVariant(var)){ While vcfpp is very simple for writing a C++ program, a single C++ header file can be easily integrated into popular script languages like Python and R etc. The process of finding biological insights -from variants information always involves statistical model, and +from variant information always involves statistical models, and Statisticians prefer analyzing and visualizing data in R. In this -example, we demonstrate how vcfpp can work with R seamlessly by -using Rcpp [cite:@eddelbuettel2011]. In the "vcfpp-r.cpp" file, we -write a simple C++ function to get a list of genotypes and its -quality with only changing the output objects to Rcpp::List. Then -we can easily call the /readbcf/ function to proceed further -statistical tests in R. +example, I demonstrate how vcfpp can work with R seamlessly by +using Rcpp [cite:@eddelbuettel2011]. With the basic knowledge of +C++ and Rcpp, we can write such a simple C++ function to get a +list of genotypes and its quality by only changing the output +objects to Rcpp::List. Then we can call the /readbcf/ function to +proceed further statistical tests in R. This exhibits the +advantage of vcfpp for developing R packages with VCF/BCF as +input. #+caption: vcfpp-r.cpp #+attr_latex: :options captionpos=t @@ -146,7 +148,7 @@ List readbcf(string vcffile) { vcfpp::BcfReader vcf(vcffile); vcfpp::BcfRecord var(vcf.header); vector> GT, GQ; - vector gt, gq; + vector gt, gq; // store a vector of genotype while (vcf.getNextVariant(var)) { var.getGenotypes(gt); var.getFORMAT("GQ",gq); // can get all other tags @@ -167,25 +169,27 @@ out <- readbcf("bcf.gz") * Benchmarking -In addition to simplicity and portability, we show how fast and +In addition to simplicity and portability, I show how fast and efficient scripts using vcfpp can be. In the benchmarking, we -performed the same task as in cyvcf2 paper. As shown in Table -[[tb:counthets]], the "test-cyvcf2.py" is $1.3\times$ slower than the -"test-vcfpp-1.R", and both using little RAM in one-step strategy. As -R packages usually load data into tables first and perform analysis -later, we also made a function to load whole data in R for such -two-step comparison although it is not preferable for this -task. Notably, the "test-vcfpp-2.R" script is only $1.9\times$ -slower compared to $70\times$, $85\times$ slower for "test-vcfR.R" -and "test-fread.R". This is because genotypes made by both vcfR and -fread are characters, which is very slow to parse in R although we -use a faster string library. However, using vcfpp can get integer -vector directly in R for computation. Moreover, regarding the best -practice of data analysis in R, we usually want to inspect part -of the data table first to make further decisions. And vcfpp has the -full functionalities of htslib that is supporting reading BCF, -selecting samples and regions. A quick lookup can be achieved by -passing a region parameter. +perform the same task as the cyvcf2 paper, that is counting +heterozygous genotypes per sample using code in Listing 1. As shown +in Table [[tb:counthets]], the "test-cyvcf2.py" is $1.3\times$ slower +than the "test-vcfpp-1.R", and both use little RAM in a streaming +strategy. As R packages usually load data into tables first and +perform analysis later, I also write a function to load whole VCF +content in R for such two-step comparison although it is not +preferable for this task. Notably, the "test-vcfpp-2.R" script is +only $1.9\times$ slower compared to $70\times$, $85\times$ slower +for "test-vcfR.R" and "test-fread.R". This is because genotypes made +by both vcfR and data.table::fread are characters, which are very +slow to parse in R despite a faster string library was +used. However, using vcfpp can get integer vectors directly in R for +computation. Moreover, regarding the best practice of data analysis +in R, we usually want to inspect part of the data table first to +make further decisions. And vcfpp has the full functionalities of +htslib that is supporting reading BCF, selecting samples and +regions. A quick lookup of VCF content can be achieved by passing a +region parameter. #+caption: Performance of counting heterozygous genotypes per sample in the 1000 Genome Project for chromosome 21. (^) used by /sourceCpp/ function. (*) used by loading data in two-step strategy. #+name: tb:counthets @@ -203,8 +207,8 @@ passing a region parameter. * Discussion -We have developed vcfpp, a fast and flexible C++ API for scripting -high-perfomance genetic variant analyses. Its simplicity and +I have developed vcfpp, a fast and flexible C++ API for scripting +high-performance genetic variant analyses. Its simplicity and portability can be very useful for both developing packages and writing daily used scripts. Many existing packages written in C++ using customized VCF parser can be replaced with vcfpp to offer more @@ -219,11 +223,13 @@ https://github.com/Zilong-Li/vcfpp/tree/main/tools. The latest release of vcfpp.h can be downloaded from https://github.com/Zilong-Li/vcfpp/releases. Scripts for Benchmarking can be found here -https://github.com/Zilong-Li/vcfpp/tree/main/scripts. +https://github.com/Zilong-Li/vcfpp/tree/main/scripts. More helpful +example functions can be found here +https://github.com/Zilong-Li/vcfpp/tree/main/tools. * Acknowledgments -We would like to thank Anders Albrechtsen and Robert W Davies for +I would like to thank Anders Albrechtsen and Robert W Davies for helpful comments. Also, this project is largely inspired by open source projects SeqLib and cyvcf2. diff --git a/readme.org b/readme.org index 08e8b7e..0e5e480 100644 --- a/readme.org +++ b/readme.org @@ -19,23 +19,29 @@ Features: * Table of Contents :TOC:QUOTE: #+BEGIN_QUOTE - [[#installation][Installation]] -- [[#examples][Examples]] +- [[#usage][Usage]] - [[#reading-vcf][Reading VCF]] - [[#writing-vcf][Writing VCF]] -- [[#tools][Tools]] + - [[#variants-operation][Variants Operation]] + - [[#header-operation][Header Operation]] +- [[#rcpp-examples][Rcpp examples]] +- [[#command-line-tools][Command Line Tools]] #+END_QUOTE * Installation -- download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]] and include it in your program. -- make sure you have https://github.com/samtools/htslib installed on your system and the it is in your environment. -* Examples - -You can copy paste the example code into a =example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts -lz -lm -lbz2 -llzma -lcurl= +1. install [[https://github.com/samtools/htslib][htslib]] on your system +2. download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]] +3. put =vcfpp.h= in a folder say =/my/writable/path/= or the same folder as your cpp source file or the system path + +* Usage ** Reading VCF -In this example, we count the number of heterozygous sites for each sample in all records. The core is *just 10 lines*. +In this example, we count the number of heterozygous genotypes for each +sample in all records. You can copy paste the example code into a +=example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts=. +You can replace =-I.= with =-I/my/writable/path/= if you put =vcfpp.h= there. #+begin_src C++ #include "vcfpp.h" @@ -44,13 +50,12 @@ using namespace vcfpp; int main(int argc, char* argv[]) { // read data from 1000 genomes server - BcfReader vcf("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr22.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"); + BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"); BcfRecord var(vcf.header); // construct a variant record - vector gt; // genotype can be of bool, char or int type + vector gt; // genotype can be bool, char or int type vector hetsum(vcf.nsamples, 0); while (vcf.getNextVariant(var)) { var.getGenotypes(gt); - // analyze SNP variant with no genotype missingness if (!var.isSNP() || !var.isNoneMissing()) continue; assert(var.ploidy()==2); // make sure it is diploidy for(int i = 0; i < gt.size() / 2; i++) @@ -63,33 +68,78 @@ int main(int argc, char* argv[]) ** Writing VCF -In this example, we read vcf and keep variants with allele freqency AF > 0.05. Meanwhile, we output a new BCF file. +There many ways in vcfpp for writing the VCF/BCF file. + +*** Use an empty template + +Here we constuct an initial BCF with header using VCF4.3 +specification. Next we add meta data in the header and write out +variant record given a string. #+begin_src C++ -#include "vcfpp.h" -using namespace std; -using namespace vcfpp; -int main(int argc, char* argv[]) { - // read data from 1000 genomes server - BcfReader vcf("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr22.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"); - BcfRecord var(vcf.header); // construct a variant record - BcfWriter bw("out.bcf", vcf.header); // construct a vcf writer using same vcf header - vector gt; // genotype can be of bool, char or int type - while (vcf.getNextVariant(var)) { - if (!var.isSNP()) continue; // skip other type of variants - var.getGenotypes(gt); - int ac = 0; - for(auto g : gt) ac += g; - if((double)ac / gt.size() > 0.05) continue; - bw.writeRecord(var); - } - return 0; -} +BcfWriter bw("out.bcf.gz", "VCF4.3"); +bw.header.addFORMAT("GT", "1", "String", "Genotype"); +bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)"); +bw.header.addContig("chr20"); // add chromosome +for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s); // add 3 samples +bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0"); #+end_src -** Genotypes +*** Use another VCF as template +In this example, we first read VCF file +=test/test-vcf-read.vcf.gz=. Secondly, we construct an empty variant record +and update the record with the input VCF. Thirdly, we construct a +BcfWriter object using the meta data in the header of the input VCF, +writing out the header and the modified variant record. -* Tools +#+begin_src C++ +BcfReader br("test/test-vcf-read.vcf.gz"); +BcfRecord var(br.header); +br.getNextVariant(var); +BcfWriter bw("out.vcf.gz", br.header); +bw.writeHeader(); +var.setPOS(100001); // update the POS of the variant +bw.writeRecord(var); +#+end_src + +** Variants Operation + +All variants related API can be found [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_record][BcfRecord]]. The commonly used are listed below. + +#+begin_src C++ +BcfReader vcf("bcf.gz"); // construct a vcf reader +BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header +vcf.getNextVariant(var) // get next variant +vector gt; // genotype can be bool, char or int type +var.getGenotypes(gt); // get genotypes for current variant +var.isNoneMissing(); // check if there is missing value after getting genotypes +vector gq; // genotype quality usually is of int type +var.getFORMAT("GQ",gq); // get a vector of genotypes quality +vector pl; // Phred-scaled genotype likelihoods usually is of int type +var.getFORMAT("PL",pl); // get a vector of Phred-scaled genotype likelihoods +float hwe, af; +var.getINFO("HWE",hwe) // get HWE value from INFO +var.getINFO("AF", af) // get AF (allele frequency) value from INFO +int mq; +var.getINFO("MQ",mq) // get MQ (Average mapping quality) value from INFO +vector dp4; // Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases +var.getINFO("DP4", dp4) // get a vector of dp4 value from INFO +var.isSNP(); // check if variant is SNP +var.isSV(); // check if variant is SV +var.isIndel(); // check if variant is indel +var.isMultiAllelic(); // check if variant is indel +#+end_src + +** Header Operation + +All variants related API can be found in [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_header][BcfHeader]]. + +* Rcpp examples + +Here are Rcpp examples + +* Command Line Tools Find more useful algorithms/tools with VCF/BCF input in [[tools]] + diff --git a/scripts/bench.sh b/scripts/bench.sh index f0de18d..c8669ff 100644 --- a/scripts/bench.sh +++ b/scripts/bench.sh @@ -20,8 +20,6 @@ vcffile="1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vc # do the benchmarking in same conda env $gtime -vvv Rscript test-fread.R $vcffile &> test-fread.llog.1 & -$gtime -vvv Rscript test-vcfR.R $vcffile 1 &> test-vcfR.llog.1 & -$gtime -vvv Rscript test-vcfR.R $vcffile 2 &> test-vcfR.llog.2 & $gtime -vvv Rscript test-vcfpp-1.R $vcffile &> test-vcfpp.llog.1 & $gtime -vvv Rscript test-vcfpp-2.R $vcffile &> test-vcfpp.llog.2 & $gtime -vvv python test-cyvcf2.py $vcffile &> test-cyvcf2.llog & diff --git a/vcfpp.h b/vcfpp.h index 9b7e2d5..6b34d30 100644 --- a/vcfpp.h +++ b/vcfpp.h @@ -2,7 +2,7 @@ * @file https://github.com/Zilong-Li/vcfpp/vcfpp.h * @author Zilong Li * @email zilong.dk@gmail.com - * @version v0.1.7 + * @version v0.1.8 * @breif a single C++ file for manipulating VCF * Copyright (C) 2022-2023.The use of this code is governed by the LICENSE file. ******************************************************************************/