Skip to content

Commit

Permalink
update script
Browse files Browse the repository at this point in the history
  • Loading branch information
Zilong-Li committed Sep 4, 2023
1 parent e67c602 commit bb71527
Show file tree
Hide file tree
Showing 5 changed files with 37 additions and 24 deletions.
28 changes: 15 additions & 13 deletions doc/paper.org
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
\thanks{Zilong Li: Section for Computational and RNA Biology, Department of Biology, University of Copenhagen. \\
Email: [email protected].}}
\begin{titlepage} \maketitle
\begin{abstract}
Given the complexity and popularity of the VCF/BCF format as well as
ever-growing data size, there is always a need for fast and flexible
methods to manipulate it in different programming languages. Static
Expand All @@ -29,6 +30,7 @@ Email: [email protected].}}
loading VCF contents and processing genotypes than the R scripts
using vcfR and data.table. Finally, some useful command line tools
using vcfpp are available.
\end{abstract}
\end{titlepage}
#+end_export

Expand Down Expand Up @@ -57,7 +59,7 @@ is to offer full functionalities as htslib and provide simple and
safe API in a single header file that can be easily integrated for
writing scripts quickly in C++ and R.

* Features
* Methods

vcfpp is implemented as a single header file for being easily
integrated and compiled. There are four core class for manipulating
Expand All @@ -74,7 +76,7 @@ uncompressed and compressed VCF/BCF as summarized in Table [[tb:class]].
| VCF/BCF header and operations | BcfHeader |
|---------------------------------+-----------|

* Usage
* Results

To demonstrate the power and performance of vcfpp, the
following sections illustrate commonly used features of vcfpp and
Expand Down Expand Up @@ -167,7 +169,7 @@ out <- readbcf("bcf.gz")
## next perform statistical modeling
#+end_src

* Benchmarking
** Benchmarking

In addition to simplicity and portability, I show how fast and
efficient scripts using vcfpp can be. In the benchmarking, we
Expand All @@ -194,16 +196,16 @@ achieved by passing a region parameter.
#+caption: Performance of counting heterozygous genotypes per sample in the 1000 Genome Project for chromosome 21. (^) used by /sourceCpp/ function. (*) used by loading data in two-step strategy.
#+name: tb:counthets
#+attr_latex: :align lllllll :placement [H]
|-------------------+------------+-------+------------+-----------+----------------|
| API | Time (s) | Ratio | RAM (Gb) | Strategy | Script |
|-------------------+------------+-------+------------+-----------+----------------|
| vcfpp::BcfReader | 118 | 1.0 | 0.006 | streaming | test-vcfpp.cpp |
| vcfpp::BcfReader | 119+5^ | 1.0 | 0.07+0.28^ | streaming | test-vcfpp-1.R |
| cyvcf2::VCF | 159 | 1.3 | 0.04 | streaming | test-cyvcf2.py |
| vcfpp::BcfReader | 168*+65 | 1.9 | 86 | two-step | test-vcfpp-2.R |
| vcfR::read.vcfR | 604*+7992 | 70 | 74 | two-step | test-vcfR.R |
| data.table::fread | 272*+10275 | 85 | 77 | two-step | test-fread.R |
|-------------------+------------+-------+------------+-----------+----------------|
|-------------------+------------+-------+------------+-----------|
| API | Time (s) | Ratio | RAM (Gb) | Strategy |
|-------------------+------------+-------+------------+-----------|
| vcfpp::BcfReader | 118 | 1.0 | 0.006 | streaming |
| vcfpp::BcfReader | 119+5^ | 1.0 | 0.07+0.28^ | streaming |
| cyvcf2::VCF | 159 | 1.3 | 0.04 | streaming |
| vcfppR::tableGT | 164*+196 | 1.8 | 73 | two-step |
| vcfR::read.vcfR | 651*+1382 | 9.9 | 105 | two-step |
| data.table::fread | 272*+10275 | 85 | 77 | two-step |
|-------------------+------------+-------+------------+-----------|

* Discussion

Expand Down
1 change: 1 addition & 0 deletions scripts/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ vcffile="1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vc
# do the benchmarking in same conda env

$gtime -vvv Rscript test-fread.R $vcffile &> test-fread.llog.1 &
$gtime -vvv Rscript test-vcfR.R $vcffile 2 &> test-vcfR.llog.2 &
$gtime -vvv Rscript test-vcfpp-1.R $vcffile &> test-vcfpp.llog.1 &
$gtime -vvv Rscript test-vcfpp-2.R $vcffile &> test-vcfpp.llog.2 &
$gtime -vvv python test-cyvcf2.py $vcffile &> test-cyvcf2.llog &
Expand Down
12 changes: 2 additions & 10 deletions scripts/test-vcfR.R
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,7 @@ run <- as.integer(args[2])

system.time(vcf <- read.vcfR(vcffile))

calc_hets1 <- function(gt) {
hets <- apply(gt, 2, function(a) {
o <- sapply(str_split(a, fixed("|")), as.numeric)
sum(colSums(o)==1)
})
hets
}

if(run == 2) {
gt <- extract.gt(vcf[is.biallelic(vcf),], element = 'GT')
system.time(hets<-calc_hets1(gt))
gt <- extract.gt(vcf[is.biallelic(vcf),], element = 'GT', as.numeric = TRUE)
system.time(hets <- apply(gt, 2, function(g) sum(g==1)))
}
18 changes: 18 additions & 0 deletions scripts/test-vcfppR.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
## devtools::install_github("Zilong-Li/vcfppR")

library(vcfppR)

run <- 1
args <- commandArgs(trailingOnly = TRUE)
vcffile <- args[1]
run <- as.integer(args[2])

system.time(vcf <- tableGT(vcffile, "chr21"))

if(run == 2) {
res <- sapply(vcf[["gt"]], function(a) {
n=length(a)
abs(a[seq(1,n,2)]-a[seq(2,n,2)])
})
hets<-rowSums(res)
}
2 changes: 1 addition & 1 deletion tools/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ HTSLIB = /usr/local/lib
VCFPP = ../
CXX = g++
CXXFLAGS = -std=c++11 -Wall -O3
INC = -I$(HTSINC)
INC = -I$(HTSINC) -I$(VCFPP)
LDFLAGS = -L$(HTSLIB) -Wl,-rpath,$(HTSLIB)
LIBS = -lhts -lz -lm -lbz2 -llzma -lcurl -lpthread
# OBJS = $(patsubst %.cpp, %.o, $(wildcard *.cpp))
Expand Down

0 comments on commit bb71527

Please sign in to comment.