Skip to content

Commit

Permalink
update doc
Browse files Browse the repository at this point in the history
  • Loading branch information
Zilong-Li committed Aug 30, 2023
1 parent b0d6b53 commit 42c4256
Show file tree
Hide file tree
Showing 4 changed files with 163 additions and 108 deletions.
150 changes: 78 additions & 72 deletions doc/paper.org
Original file line number Diff line number Diff line change
Expand Up @@ -15,19 +15,20 @@ Email: [email protected].}}
methods to manipulate it in different programming languages. Static
languages like C++ has the superiority of performance, and modern
C++ standards strive to provide growing standard library for
developing program quickly and easily. In this work. we present
vcfpp, a C++ API of htslib, providing a rapid and intuitive
interface to manipulate VCF/BCF files safely. The simplicity and
portability makes vcfpp working with scripting language like python
and R easily. An example demonstrates how minimal vcfpp code can
manipulate variant records rapidly. Additionally, another example
demonstrates how vcfpp can be useful to Statisticians analyzing
genetic variants in R. In the Benchmarking, a simple R script using
vcfpp API shows faster speed than script using cyvcf2 when streaming
variant analysis. And in a two-step setting, a R script using vcfpp
has faster speed of both loading VCF data and processing genotypes
than the R scripts using vcfR and data.table. Finally, some useful
command line tools using vcfpp are available publicly.
developing program quickly and easily. In this work, I present
vcfpp, a C++ API of htslib in a single header file, providing an
intuitive interface to manipulate VCF/BCF files rapidly and
safely. An example demonstrates how minimal vcfpp code can
manipulate variant records rapidly. Additionally, the simplicity and
portability makes vcfpp working with scripting language like R
easily. The second example demonstrates how vcfpp can be useful to
Statisticians analyzing genomics variations in R. In the
Benchmarking, a simple R script using vcfpp API shows faster speed
than script using cyvcf2 when streaming variant analysis. And in a
two-step setting, a R script using vcfpp has faster speed of both
loading VCF contents and processing genotypes than the R scripts
using vcfR and data.table. Finally, some useful command line tools
using vcfpp are available.
\end{titlepage}
#+end_export

Expand All @@ -43,24 +44,23 @@ efficiently. The C API of htslib provide the full functionalities to
manipulate VCF and BCF format for both compressed and uncompressed
files. As the C API is hard to use for in-proficient programmers,
API of other languages are invented. Existing popular libraries
include vcfR[cite:@brian2017] for R, cyvcf2 for python
[cite:@pedersen2017a], hts-nim [cite:@pedersen2018] for nim and
vcflib for C++ [cite:@garrison2022]. All are useful for
corresponding language community and have both advantages and
disadvantages. In particular, vcflib is both an API and a large
collection of command line tools, with the main disadvantage being
not supporting BCF file. It is noteworthy that many methods written
in C++ designed for large sample size can not even input the BCF
file which is tailed for big data like Syllbable-PBWT
[cite:@wang2023]. The motivation of vcfpp is to offer full
functionalities as htslib and provide simple and safe API in a
single header file that can be easily integrated for writing scripts
quickly in C++, python and R.
include vcfR[cite:@brian2017] for R, cyvcf2 [cite:@pedersen2017a]
for Python, hts-nim [cite:@pedersen2018] for Nim and vcflib
[cite:@garrison2022] for C++. All are useful for corresponding
community, having both advantages and disadvantages. In particular,
vcflib is both an API and a large collection of command line tools,
with the main disadvantage being not supporting BCF files. It is
noteworthy that many methods written in C++ designed for large
sample size can not even input the BCF file which is tailed for big
data like Syllbable-PBWT [cite:@wang2023]. The motivation of vcfpp
is to offer full functionalities as htslib and provide simple and
safe API in a single header file that can be easily integrated for
writing scripts quickly in C++ and R.

* Features

vcfpp is implemented as a single header file for being easily
intergrated and compiled. There are four core class for manipulating
integrated and compiled. There are four core class for manipulating
uncompressed and compressed VCF/BCF as summarized in Table [[tb:class]].

#+caption: vcfpp capabilities and implemented C++ class
Expand All @@ -76,12 +76,12 @@ uncompressed and compressed VCF/BCF as summarized in Table [[tb:class]].

* Usage

In an effort to demonstrate the power and performance of vcfpp, the
following sections highlight typical VCF analyses and illustrate
commonly used features in vcfpp. Other examples and further details
of the vcfpp can be found at https://github.com/Zilong-Li/vcfpp.
To demonstrate the power and performance of vcfpp, the
following sections illustrate commonly used features of vcfpp and
highlight the portability of vcfpp in R. Other examples and further
details of vcfpp can be found at https://github.com/Zilong-Li/vcfpp.

** Python-like API
** C++ API

In this example, we count the number of heterozygous sites for
each sample in all records. The following code first includes a
Expand All @@ -90,16 +90,16 @@ constrained to 3 samples and "chr21" region (line 2), and creates
a variant record object associated with the header information in
the BCF (line 3). Then it defines several types of objects to
collect the results we want (line 4-6). Taking advantage of
generic templates in C++, type of field like genotype can be very
flexible. Then it starts iterating over the BCF file and process
each variant record in the loop (line 7). We ignore variant of
other types (INDEL, SV), with missing genotypes and with QUAL
value smaller than 9, while we can also retrieve more information
of the variant and do filtering (line 9-10). Finally, we count the
heterozygous sites for each diploid sample (line 11-13). The core
is just 13 lines.

#+caption: Counting heterozygous genotypes per sample on chr21
generic templates in C++, the type of field like genotype can be
very flexible. Then it starts iterating over the BCF file and
processes each variant record in the loop (line 7). We ignore
variants of other types (INDEL, SV), or with missing genotypes, or
with QUAL value smaller than 9, while we can also retrieve more
information of the variant and do filtering (line 9-10). Finally,
we count the heterozygous sites for each diploid sample (line
11-13). The core is just 13 lines.

#+caption: Counting the number of heterozygous genotypes for 3 samples on chr21
#+attr_latex: :options captionpos=t
#+begin_src C++ -n
#include <vcfpp.h>
Expand All @@ -125,14 +125,16 @@ while(vcf.getNextVariant(var)){
While vcfpp is very simple for writing a C++ program, a single C++
header file can be easily integrated into popular script languages
like Python and R etc. The process of finding biological insights
from variants information always involves statistical model, and
from variant information always involves statistical models, and
Statisticians prefer analyzing and visualizing data in R. In this
example, we demonstrate how vcfpp can work with R seamlessly by
using Rcpp [cite:@eddelbuettel2011]. In the "vcfpp-r.cpp" file, we
write a simple C++ function to get a list of genotypes and its
quality with only changing the output objects to Rcpp::List. Then
we can easily call the /readbcf/ function to proceed further
statistical tests in R.
example, I demonstrate how vcfpp can work with R seamlessly by
using Rcpp [cite:@eddelbuettel2011]. With the basic knowledge of
C++ and Rcpp, we can write such a simple C++ function to get a
list of genotypes and its quality by only changing the output
objects to Rcpp::List. Then we can call the /readbcf/ function to
proceed further statistical tests in R. This exhibits the
advantage of vcfpp for developing R packages with VCF/BCF as
input.

#+caption: vcfpp-r.cpp
#+attr_latex: :options captionpos=t
Expand All @@ -146,7 +148,7 @@ List readbcf(string vcffile) {
vcfpp::BcfReader vcf(vcffile);
vcfpp::BcfRecord var(vcf.header);
vector<vector<int>> GT, GQ;
vector<int> gt, gq;
vector<int> gt, gq; // store a vector of genotype
while (vcf.getNextVariant(var)) {
var.getGenotypes(gt);
var.getFORMAT("GQ",gq); // can get all other tags
Expand All @@ -167,25 +169,27 @@ out <- readbcf("bcf.gz")

* Benchmarking

In addition to simplicity and portability, we show how fast and
In addition to simplicity and portability, I show how fast and
efficient scripts using vcfpp can be. In the benchmarking, we
performed the same task as in cyvcf2 paper. As shown in Table
[[tb:counthets]], the "test-cyvcf2.py" is $1.3\times$ slower than the
"test-vcfpp-1.R", and both using little RAM in one-step strategy. As
R packages usually load data into tables first and perform analysis
later, we also made a function to load whole data in R for such
two-step comparison although it is not preferable for this
task. Notably, the "test-vcfpp-2.R" script is only $1.9\times$
slower compared to $70\times$, $85\times$ slower for "test-vcfR.R"
and "test-fread.R". This is because genotypes made by both vcfR and
fread are characters, which is very slow to parse in R although we
use a faster string library. However, using vcfpp can get integer
vector directly in R for computation. Moreover, regarding the best
practice of data analysis in R, we usually want to inspect part
of the data table first to make further decisions. And vcfpp has the
full functionalities of htslib that is supporting reading BCF,
selecting samples and regions. A quick lookup can be achieved by
passing a region parameter.
perform the same task as the cyvcf2 paper, that is counting
heterozygous genotypes per sample using code in Listing 1. As shown
in Table [[tb:counthets]], the "test-cyvcf2.py" is $1.3\times$ slower
than the "test-vcfpp-1.R", and both use little RAM in a streaming
strategy. As R packages usually load data into tables first and
perform analysis later, I also write a function to load whole VCF
content in R for such two-step comparison although it is not
preferable for this task. Notably, the "test-vcfpp-2.R" script is
only $1.9\times$ slower compared to $70\times$, $85\times$ slower
for "test-vcfR.R" and "test-fread.R". This is because genotypes made
by both vcfR and data.table::fread are characters, which are very
slow to parse in R despite a faster string library was
used. However, using vcfpp can get integer vectors directly in R for
computation. Moreover, regarding the best practice of data analysis
in R, we usually want to inspect part of the data table first to
make further decisions. And vcfpp has the full functionalities of
htslib that is supporting reading BCF, selecting samples and
regions. A quick lookup of VCF content can be achieved by passing a
region parameter.

#+caption: Performance of counting heterozygous genotypes per sample in the 1000 Genome Project for chromosome 21. (^) used by /sourceCpp/ function. (*) used by loading data in two-step strategy.
#+name: tb:counthets
Expand All @@ -203,8 +207,8 @@ passing a region parameter.

* Discussion

We have developed vcfpp, a fast and flexible C++ API for scripting
high-perfomance genetic variant analyses. Its simplicity and
I have developed vcfpp, a fast and flexible C++ API for scripting
high-performance genetic variant analyses. Its simplicity and
portability can be very useful for both developing packages and
writing daily used scripts. Many existing packages written in C++
using customized VCF parser can be replaced with vcfpp to offer more
Expand All @@ -219,11 +223,13 @@ https://github.com/Zilong-Li/vcfpp/tree/main/tools.
The latest release of vcfpp.h can be downloaded from
https://github.com/Zilong-Li/vcfpp/releases. Scripts for
Benchmarking can be found here
https://github.com/Zilong-Li/vcfpp/tree/main/scripts.
https://github.com/Zilong-Li/vcfpp/tree/main/scripts. More helpful
example functions can be found here
https://github.com/Zilong-Li/vcfpp/tree/main/tools.

* Acknowledgments

We would like to thank Anders Albrechtsen and Robert W Davies for
I would like to thank Anders Albrechtsen and Robert W Davies for
helpful comments. Also, this project is largely inspired by open
source projects SeqLib and cyvcf2.

Expand Down
117 changes: 84 additions & 33 deletions readme.org
Original file line number Diff line number Diff line change
Expand Up @@ -19,23 +19,30 @@ Features:
* Table of Contents :TOC:QUOTE:
#+BEGIN_QUOTE
- [[#installation][Installation]]
- [[#examples][Examples]]
- [[#usage][Usage]]
- [[#reading-vcf][Reading VCF]]
- [[#writing-vcf][Writing VCF]]
- [[#tools][Tools]]
- [[#variants-operation][Variants Operation]]
- [[#header-operation][Header Operation]]
- [[#rcpp-examples][Rcpp examples]]
- [[#command-line-tools][Command Line Tools]]
#+END_QUOTE

* Installation
- download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]] and include it in your program.
- make sure you have https://github.com/samtools/htslib installed on your system and the it is in your environment.

* Examples

You can copy paste the example code into a =example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts -lz -lm -lbz2 -llzma -lcurl=
1. install [[https://github.com/samtools/htslib][htslib]] on your system
2. download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]]
3. put =vcfpp.h= in a folder say =/my/writable/path/= or the same folder as your cpp source file or the system path

* Usage

** Reading VCF

In this example, we count the number of heterozygous sites for each sample in all records. The core is *just 10 lines*.
In this example, we count the number of heterozygous sites for each
sample in all records. You can copy paste the example code into a
=example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3
-Wall -I. -lhts=. You can replace =-I.= with =-I/my/writable/path/= if you
put =vcfpp.h= there.

#+begin_src C++
#include "vcfpp.h"
Expand All @@ -44,13 +51,12 @@ using namespace vcfpp;
int main(int argc, char* argv[])
{
// read data from 1000 genomes server
BcfReader vcf("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr22.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
BcfRecord var(vcf.header); // construct a variant record
vector<char> gt; // genotype can be of bool, char or int type
vector<char> gt; // genotype can be bool, char or int type
vector<int> hetsum(vcf.nsamples, 0);
while (vcf.getNextVariant(var)) {
var.getGenotypes(gt);
// analyze SNP variant with no genotype missingness
if (!var.isSNP() || !var.isNoneMissing()) continue;
assert(var.ploidy()==2); // make sure it is diploidy
for(int i = 0; i < gt.size() / 2; i++)
Expand All @@ -63,33 +69,78 @@ int main(int argc, char* argv[])

** Writing VCF

In this example, we read vcf and keep variants with allele freqency AF > 0.05. Meanwhile, we output a new BCF file.
There many ways in vcfpp for writing the VCF/BCF file.

*** Use an empty template

Here we constuct an initial BCF with header using VCF4.3
specification. Next we add meta data in the header and write out
variant record given a string.

#+begin_src C++
#include "vcfpp.h"
using namespace std;
using namespace vcfpp;
int main(int argc, char* argv[]) {
// read data from 1000 genomes server
BcfReader vcf("http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr22.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
BcfRecord var(vcf.header); // construct a variant record
BcfWriter bw("out.bcf", vcf.header); // construct a vcf writer using same vcf header
vector<char> gt; // genotype can be of bool, char or int type
while (vcf.getNextVariant(var)) {
if (!var.isSNP()) continue; // skip other type of variants
var.getGenotypes(gt);
int ac = 0;
for(auto g : gt) ac += g;
if((double)ac / gt.size() > 0.05) continue;
bw.writeRecord(var);
}
return 0;
}
BcfWriter bw("out.bcf", "VCF4.3");
bw.header.addFORMAT("GT", "1", "String", "Genotype");
bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)");
bw.header.addContig("chr20");
for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s);
bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0");
#+end_src

** Genotypes
*** Use another VCF as template

In this example, we first read VCF file
=test/test-vcf-read.vcf.gz=. Next we construct an empty variant record
and update the record with the input VCF. Thirdly, we construct a
BcfWriter object using the meta data in the header of the input VCF,
writing out the header and the modified variant.

* Tools
#+begin_src C++
BcfReader br("test/test-vcf-read.vcf.gz");
BcfRecord var(br.header);
br.getNextVariant(var);
BcfWriter bw("test.vcf", br.header);
bw.writeHeader();
var.setPOS(1);
bw.writeRecord(var);
#+end_src

** Variants Operation

All variants related API can be found [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_record][BcfRecord]]. The commonly used are listed below.

#+begin_src C++
BcfReader vcf("bcf.gz"); // construct a vcf reader
BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header
vcf.getNextVariant(var) // get next variant
vector<char> gt; // genotype can be bool, char or int type
var.getGenotypes(gt); // get genotypes for current variant
var.isNoneMissing(); // check if there is missing value after getting genotypes
vector<int> gq; // genotype quality usually is of int type
var.getFORMAT("GQ",gq); // get a vector of genotypes quality
vector<int> pl; // Phred-scaled genotype likelihoods usually is of int type
var.getFORMAT("PL",pl); // get a vector of Phred-scaled genotype likelihoods
float hwe, af;
var.getINFO("HWE",hwe) // get HWE value from INFO
var.getINFO("AF", af) // get AF (allele frequency) value from INFO
int mq;
var.getINFO("MQ",mq) // get MQ (Average mapping quality) value from INFO
vector<int> dp4; // Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases
var.getINFO("DP4", dp4) // get a vector of dp4 value from INFO
var.isSNP(); // check if variant is SNP
var.isSV(); // check if variant is SV
var.isIndel(); // check if variant is indel
var.isMultiAllelic(); // check if variant is indel
#+end_src

** Header Operation

All variants related API can be found in [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_header][BcfHeader]].

* Rcpp examples

Here are Rcpp examples

* Command Line Tools

Find more useful algorithms/tools with VCF/BCF input in [[tools]]

2 changes: 0 additions & 2 deletions scripts/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,6 @@ vcffile="1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vc
# do the benchmarking in same conda env

$gtime -vvv Rscript test-fread.R $vcffile &> test-fread.llog.1 &
$gtime -vvv Rscript test-vcfR.R $vcffile 1 &> test-vcfR.llog.1 &
$gtime -vvv Rscript test-vcfR.R $vcffile 2 &> test-vcfR.llog.2 &
$gtime -vvv Rscript test-vcfpp-1.R $vcffile &> test-vcfpp.llog.1 &
$gtime -vvv Rscript test-vcfpp-2.R $vcffile &> test-vcfpp.llog.2 &
$gtime -vvv python test-cyvcf2.py $vcffile &> test-cyvcf2.llog &
Expand Down
2 changes: 1 addition & 1 deletion vcfpp.h
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
* @file https://github.com/Zilong-Li/vcfpp/vcfpp.h
* @author Zilong Li
* @email [email protected]
* @version v0.1.7
* @version v0.1.8
* @breif a single C++ file for manipulating VCF
* Copyright (C) 2022-2023.The use of this code is governed by the LICENSE file.
******************************************************************************/
Expand Down

0 comments on commit 42c4256

Please sign in to comment.