update doc

Zilong-Li · Aug 30, 2023 · 42c4256 · 42c4256
1 parent b0d6b53
commit 42c4256
Show file tree

Hide file tree

Showing 4 changed files with 163 additions and 108 deletions.
diff --git a/doc/paper.org b/doc/paper.org
@@ -15,19 +15,20 @@ Email: [email protected].}}
   methods to manipulate it in different programming languages. Static
   languages like C++ has the superiority of performance, and modern
   C++ standards strive to provide growing standard library for
-  developing program quickly and easily. In this work. we present
-  vcfpp, a C++ API of htslib, providing a rapid and intuitive
-  interface to manipulate VCF/BCF files safely. The simplicity and
-  portability makes vcfpp working with scripting language like python
-  and R easily. An example demonstrates how minimal vcfpp code can
-  manipulate variant records rapidly. Additionally, another example
-  demonstrates how vcfpp can be useful to Statisticians analyzing
-  genetic variants in R. In the Benchmarking, a simple R script using
-  vcfpp API shows faster speed than script using cyvcf2 when streaming
-  variant analysis. And in a two-step setting, a R script using vcfpp
-  has faster speed of both loading VCF data and processing genotypes
-  than the R scripts using vcfR and data.table. Finally, some useful
-  command line tools using vcfpp are available publicly.
+  developing program quickly and easily. In this work, I present
+  vcfpp, a C++ API of htslib in a single header file, providing an
+  intuitive interface to manipulate VCF/BCF files rapidly and
+  safely. An example demonstrates how minimal vcfpp code can
+  manipulate variant records rapidly. Additionally, the simplicity and
+  portability makes vcfpp working with scripting language like R
+  easily. The second example demonstrates how vcfpp can be useful to
+  Statisticians analyzing genomics variations in R. In the
+  Benchmarking, a simple R script using vcfpp API shows faster speed
+  than script using cyvcf2 when streaming variant analysis. And in a
+  two-step setting, a R script using vcfpp has faster speed of both
+  loading VCF contents and processing genotypes than the R scripts
+  using vcfR and data.table. Finally, some useful command line tools
+  using vcfpp are available.
 \end{titlepage}
 #+end_export
 
@@ -43,24 +44,23 @@ efficiently. The C API of htslib provide the full functionalities to
 manipulate VCF and BCF format for both compressed and uncompressed
 files. As the C API is hard to use for in-proficient programmers,
 API of other languages are invented. Existing popular libraries
-include vcfR[cite:@brian2017] for R, cyvcf2 for python
-[cite:@pedersen2017a], hts-nim [cite:@pedersen2018] for nim and
-vcflib for C++ [cite:@garrison2022]. All are useful for
-corresponding language community and have both advantages and
-disadvantages. In particular, vcflib is both an API and a large
-collection of command line tools, with the main disadvantage being
-not supporting BCF file. It is noteworthy that many methods written
-in C++ designed for large sample size can not even input the BCF
-file which is tailed for big data like Syllbable-PBWT
-[cite:@wang2023]. The motivation of vcfpp is to offer full
-functionalities as htslib and provide simple and safe API in a
-single header file that can be easily integrated for writing scripts
-quickly in C++, python and R.
+include vcfR[cite:@brian2017] for R, cyvcf2 [cite:@pedersen2017a]
+for Python, hts-nim [cite:@pedersen2018] for Nim and vcflib
+[cite:@garrison2022] for C++. All are useful for corresponding
+community, having both advantages and disadvantages. In particular,
+vcflib is both an API and a large collection of command line tools,
+with the main disadvantage being not supporting BCF files. It is
+noteworthy that many methods written in C++ designed for large
+sample size can not even input the BCF file which is tailed for big
+data like Syllbable-PBWT [cite:@wang2023]. The motivation of vcfpp
+is to offer full functionalities as htslib and provide simple and
+safe API in a single header file that can be easily integrated for
+writing scripts quickly in C++ and R.
 
 * Features 
 
 vcfpp is implemented as a single header file for being easily
-intergrated and compiled. There are four core class for manipulating
+integrated and compiled. There are four core class for manipulating
 uncompressed and compressed VCF/BCF as summarized in Table [[tb:class]].
 
 #+caption: vcfpp capabilities and implemented C++ class
@@ -76,12 +76,12 @@ uncompressed and compressed VCF/BCF as summarized in Table [[tb:class]].
 
 * Usage
 
-In an effort to demonstrate the power and performance of vcfpp, the
-following sections highlight typical VCF analyses and illustrate
-commonly used features in vcfpp. Other examples and further details
-of the vcfpp can be found at https://github.com/Zilong-Li/vcfpp.
+To demonstrate the power and performance of vcfpp, the
+following sections illustrate commonly used features of vcfpp and
+highlight the portability of vcfpp in R. Other examples and further
+details of vcfpp can be found at https://github.com/Zilong-Li/vcfpp.
 
-** Python-like API
+** C++ API
 
 In this example, we count the number of heterozygous sites for
 each sample in all records. The following code first includes a
@@ -90,16 +90,16 @@ constrained to 3 samples and "chr21" region (line 2), and creates
 a variant record object associated with the header information in
 the BCF (line 3). Then it defines several types of objects to
 collect the results we want (line 4-6). Taking advantage of
-generic templates in C++, type of field like genotype can be very
-flexible. Then it starts iterating over the BCF file and process
-each variant record in the loop (line 7). We ignore variant of
-other types (INDEL, SV), with missing genotypes and with QUAL
-value smaller than 9, while we can also retrieve more information
-of the variant and do filtering (line 9-10). Finally, we count the
-heterozygous sites for each diploid sample (line 11-13). The core
-is just 13 lines.
-
-#+caption: Counting heterozygous genotypes per sample on chr21
+generic templates in C++, the type of field like genotype can be
+very flexible. Then it starts iterating over the BCF file and
+processes each variant record in the loop (line 7). We ignore
+variants of other types (INDEL, SV), or with missing genotypes, or
+with QUAL value smaller than 9, while we can also retrieve more
+information of the variant and do filtering (line 9-10). Finally,
+we count the heterozygous sites for each diploid sample (line
+11-13). The core is just 13 lines.
+
+#+caption: Counting the number of heterozygous genotypes for 3 samples on chr21
 #+attr_latex: :options captionpos=t
 #+begin_src C++ -n
 #include <vcfpp.h>
@@ -125,14 +125,16 @@ while(vcf.getNextVariant(var)){
 While vcfpp is very simple for writing a C++ program, a single C++
 header file can be easily integrated into popular script languages
 like Python and R etc. The process of finding biological insights
-from variants information always involves statistical model, and
+from variant information always involves statistical models, and
 Statisticians prefer analyzing and visualizing data in R. In this
-example, we demonstrate how vcfpp can work with R seamlessly by
-using Rcpp [cite:@eddelbuettel2011]. In the "vcfpp-r.cpp" file, we
-write a simple C++ function to get a list of genotypes and its
-quality with only changing the output objects to Rcpp::List. Then
-we can easily call the /readbcf/ function to proceed further
-statistical tests in R.
+example, I demonstrate how vcfpp can work with R seamlessly by
+using Rcpp [cite:@eddelbuettel2011]. With the basic knowledge of
+C++ and Rcpp, we can write such a simple C++ function to get a
+list of genotypes and its quality by only changing the output
+objects to Rcpp::List. Then we can call the /readbcf/ function to
+proceed further statistical tests in R. This exhibits the
+advantage of vcfpp for developing R packages with VCF/BCF as
+input.
 
 #+caption: vcfpp-r.cpp
 #+attr_latex: :options captionpos=t
@@ -146,7 +148,7 @@ List readbcf(string vcffile) {
     vcfpp::BcfReader vcf(vcffile);
     vcfpp::BcfRecord var(vcf.header);
     vector<vector<int>> GT, GQ;
-    vector<int> gt, gq;
+    vector<int> gt, gq; // store a vector of genotype
     while (vcf.getNextVariant(var)) {
         var.getGenotypes(gt);
         var.getFORMAT("GQ",gq); // can get all other tags
@@ -167,25 +169,27 @@ out <- readbcf("bcf.gz")
 
 * Benchmarking
 
-In addition to simplicity and portability, we show how fast and
+In addition to simplicity and portability, I show how fast and
 efficient scripts using vcfpp can be. In the benchmarking, we
-performed the same task as in cyvcf2 paper. As shown in Table
-[[tb:counthets]], the "test-cyvcf2.py" is $1.3\times$ slower than the
-"test-vcfpp-1.R", and both using little RAM in one-step strategy. As
-R packages usually load data into tables first and perform analysis
-later, we also made a function to load whole data in R for such
-two-step comparison although it is not preferable for this
-task. Notably, the "test-vcfpp-2.R" script is only $1.9\times$
-slower compared to $70\times$, $85\times$ slower for "test-vcfR.R"
-and "test-fread.R". This is because genotypes made by both vcfR and
-fread are characters, which is very slow to parse in R although we
-use a faster string library. However, using vcfpp can get integer
-vector directly in R for computation. Moreover, regarding the best
-practice of data analysis in R, we usually want to inspect part
-of the data table first to make further decisions. And vcfpp has the
-full functionalities of htslib that is supporting reading BCF,
-selecting samples and regions. A quick lookup can be achieved by
-passing a region parameter.
+perform the same task as the cyvcf2 paper, that is counting
+heterozygous genotypes per sample using code in Listing 1. As shown
+in Table [[tb:counthets]], the "test-cyvcf2.py" is $1.3\times$ slower
+than the "test-vcfpp-1.R", and both use little RAM in a streaming
+strategy. As R packages usually load data into tables first and
+perform analysis later, I also write a function to load whole VCF
+content in R for such two-step comparison although it is not
+preferable for this task. Notably, the "test-vcfpp-2.R" script is
+only $1.9\times$ slower compared to $70\times$, $85\times$ slower
+for "test-vcfR.R" and "test-fread.R". This is because genotypes made
+by both vcfR and data.table::fread are characters, which are very
+slow to parse in R despite a faster string library was
+used. However, using vcfpp can get integer vectors directly in R for
+computation. Moreover, regarding the best practice of data analysis
+in R, we usually want to inspect part of the data table first to
+make further decisions. And vcfpp has the full functionalities of
+htslib that is supporting reading BCF, selecting samples and
+regions. A quick lookup of VCF content can be achieved by passing a
+region parameter.
 
 #+caption: Performance of counting heterozygous genotypes per sample in the 1000 Genome Project for chromosome 21. (^) used by /sourceCpp/ function. (*) used by loading data in two-step strategy.
 #+name: tb:counthets
@@ -203,8 +207,8 @@ passing a region parameter.
 
 * Discussion
 
-We have developed vcfpp, a fast and flexible C++ API for scripting
-high-perfomance genetic variant analyses. Its simplicity and
+I have developed vcfpp, a fast and flexible C++ API for scripting
+high-performance genetic variant analyses. Its simplicity and
 portability can be very useful for both developing packages and
 writing daily used scripts. Many existing packages written in C++
 using customized VCF parser can be replaced with vcfpp to offer more
@@ -219,11 +223,13 @@ https://github.com/Zilong-Li/vcfpp/tree/main/tools.
 The latest release of vcfpp.h can be downloaded from
 https://github.com/Zilong-Li/vcfpp/releases. Scripts for
 Benchmarking can be found here
-https://github.com/Zilong-Li/vcfpp/tree/main/scripts.
+https://github.com/Zilong-Li/vcfpp/tree/main/scripts. More helpful
+example functions can be found here
+https://github.com/Zilong-Li/vcfpp/tree/main/tools.
 
 * Acknowledgments
 
-We would like to thank Anders Albrechtsen and Robert W Davies for
+I would like to thank Anders Albrechtsen and Robert W Davies for
 helpful comments. Also, this project is largely inspired by open
 source projects SeqLib and cyvcf2.
 

diff --git a/readme.org b/readme.org
@@ -19,23 +19,30 @@ Features:
 * Table of Contents :TOC:QUOTE:
 #+BEGIN_QUOTE
 - [[#installation][Installation]]
-- [[#examples][Examples]]
+- [[#usage][Usage]]
   - [[#reading-vcf][Reading VCF]]
   - [[#writing-vcf][Writing VCF]]
-- [[#tools][Tools]]
+  - [[#variants-operation][Variants Operation]]
+  - [[#header-operation][Header Operation]]
+- [[#rcpp-examples][Rcpp examples]]
+- [[#command-line-tools][Command Line Tools]]
 #+END_QUOTE
 
 * Installation
-- download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]] and include it in your program.
-- make sure you have https://github.com/samtools/htslib installed on your system and the it is in your environment.
 
-* Examples
-
-You can copy paste the example code into a =example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts -lz -lm -lbz2 -llzma -lcurl=
+1. install [[https://github.com/samtools/htslib][htslib]] on your system
+2. download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]]
+3. put =vcfpp.h= in a folder say =/my/writable/path/= or the same folder as your cpp source file or the system path
+
+* Usage
 
 ** Reading VCF
 
-In this example, we count the number of heterozygous sites for each sample in all records. The core is *just 10 lines*.
+In this example, we count the number of heterozygous sites for each
+sample in all records. You can copy paste the example code into a
+=example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3
+-Wall -I. -lhts=. You can replace =-I.= with =-I/my/writable/path/= if you
+put =vcfpp.h= there.
 
 #+begin_src C++
 #include "vcfpp.h"
@@ -44,13 +51,12 @@ using namespace vcfpp;
 int main(int argc, char* argv[])
 {
     // read data from 1000 genomes server
-    BcfReader vcf("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr22.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
+    BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
     BcfRecord var(vcf.header); // construct a variant record
-    vector<char> gt; // genotype can be of bool, char or int type
+    vector<char> gt; // genotype can be bool, char or int type
     vector<int> hetsum(vcf.nsamples, 0);
     while (vcf.getNextVariant(var)) {
         var.getGenotypes(gt);
-        // analyze SNP variant with no genotype missingness
         if (!var.isSNP() || !var.isNoneMissing()) continue; 
         assert(var.ploidy()==2); // make sure it is diploidy
         for(int i = 0; i < gt.size() / 2; i++) 
@@ -63,33 +69,78 @@ int main(int argc, char* argv[])
 
 ** Writing VCF
 
-In this example, we read vcf and keep variants with allele freqency AF > 0.05. Meanwhile, we output a new BCF file.
+There many ways in vcfpp for writing the VCF/BCF file.
+
+*** Use an empty template
+
+Here we constuct an initial BCF with header using VCF4.3
+specification. Next we add meta data in the header and write out
+variant record given a string.
 
 #+begin_src C++
-#include "vcfpp.h"
-using namespace std;
-using namespace vcfpp;
-int main(int argc, char* argv[]) {
-    // read data from 1000 genomes server
-    BcfReader vcf("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr22.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
-    BcfRecord var(vcf.header); // construct a variant record
-    BcfWriter bw("out.bcf", vcf.header); // construct a vcf writer using same vcf header
-    vector<char> gt; // genotype can be of bool, char or int type
-    while (vcf.getNextVariant(var)) {
-        if (!var.isSNP()) continue; // skip other type of variants
-        var.getGenotypes(gt);
-        int ac = 0;
-        for(auto g : gt) ac += g;
-        if((double)ac / gt.size() > 0.05) continue;
-        bw.writeRecord(var);
-    }
-    return 0;
-}
+BcfWriter bw("out.bcf", "VCF4.3");
+bw.header.addFORMAT("GT", "1", "String", "Genotype");
+bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)");
+bw.header.addContig("chr20");
+for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s);
+bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0");
 #+end_src
 
-** Genotypes
+*** Use another VCF as template
 
+In this example, we first read VCF file
+=test/test-vcf-read.vcf.gz=. Next we construct an empty variant record
+and update the record with the input VCF. Thirdly, we construct a
+BcfWriter object using the meta data in the header of the input VCF,
+writing out the header and the modified variant.
 
-* Tools
+#+begin_src C++
+BcfReader br("test/test-vcf-read.vcf.gz");
+BcfRecord var(br.header);
+br.getNextVariant(var);
+BcfWriter bw("test.vcf", br.header);
+bw.writeHeader();
+var.setPOS(1);
+bw.writeRecord(var);
+#+end_src
+
+** Variants Operation
+
+All variants related API can be found [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_record][BcfRecord]]. The commonly used are listed below.
+
+#+begin_src C++
+BcfReader vcf("bcf.gz"); // construct a vcf reader
+BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header
+vcf.getNextVariant(var) // get next variant
+vector<char> gt; // genotype can be bool, char or int type
+var.getGenotypes(gt); // get genotypes for current variant
+var.isNoneMissing(); // check if there is missing value after getting genotypes
+vector<int> gq; // genotype quality usually is of int type
+var.getFORMAT("GQ",gq); // get a vector of genotypes quality 
+vector<int> pl; // Phred-scaled genotype likelihoods usually is of int type
+var.getFORMAT("PL",pl); // get a vector of Phred-scaled genotype likelihoods
+float hwe, af;
+var.getINFO("HWE",hwe) // get HWE value from INFO
+var.getINFO("AF", af) // get AF (allele frequency) value from INFO
+int mq;
+var.getINFO("MQ",mq) // get MQ (Average mapping quality) value from INFO
+vector<int> dp4; // Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases
+var.getINFO("DP4", dp4) // get a vector of dp4 value from INFO
+var.isSNP(); // check if variant is SNP
+var.isSV(); // check if variant is SV
+var.isIndel(); // check if variant is indel
+var.isMultiAllelic(); // check if variant is indel
+#+end_src
+
+** Header Operation
+
+All variants related API can be found in [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_header][BcfHeader]].
+
+* Rcpp examples
+
+Here are Rcpp examples
+
+* Command Line Tools
 
 Find more useful algorithms/tools with VCF/BCF input in [[tools]]
+
diff --git a/scripts/bench.sh b/scripts/bench.sh
@@ -20,8 +20,6 @@ vcffile="1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vc
 # do the benchmarking in same conda env
 
 $gtime -vvv Rscript test-fread.R $vcffile &> test-fread.llog.1 &
-$gtime -vvv Rscript test-vcfR.R $vcffile 1 &> test-vcfR.llog.1 &
-$gtime -vvv Rscript test-vcfR.R $vcffile 2 &> test-vcfR.llog.2 &
 $gtime -vvv Rscript test-vcfpp-1.R $vcffile &> test-vcfpp.llog.1 &
 $gtime -vvv Rscript test-vcfpp-2.R $vcffile &> test-vcfpp.llog.2 &
 $gtime -vvv python test-cyvcf2.py $vcffile &> test-cyvcf2.llog &

diff --git a/vcfpp.h b/vcfpp.h
@@ -2,7 +2,7 @@
  * @file        https://github.com/Zilong-Li/vcfpp/vcfpp.h
  * @author      Zilong Li
  * @email       [email protected]
- * @version     v0.1.7
+ * @version     v0.1.8
  * @breif       a single C++ file for manipulating VCF
  * Copyright (C) 2022-2023.The use of this code is governed by the LICENSE file.
  ******************************************************************************/