-
Notifications
You must be signed in to change notification settings - Fork 2
/
README.Rmd
151 lines (116 loc) · 5.56 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
warning = TRUE,
comment = "#>",
dpi = 100,
fig.dim = c(6, 4),
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# vcfppR: rapid manipulation of the VCF/BCF file
<!-- badges: start -->
<a href="https://github.com/Zilong-Li/random/blob/main/vcfppR.png"><img src="https://raw.githubusercontent.com/Zilong-Li/random/main/vcfppR.png" width="120" align="right" /></a>
![R-CMD-check](https://github.com/Zilong-Li/vcfppR/actions/workflows/check-release.yaml/badge.svg)
[![CRAN status](https://www.r-pkg.org/badges/version/vcfppR)](https://CRAN.R-project.org/package=vcfppR)
![https://github.com/Zilong-Li/vcfppR/releases/latest](https://img.shields.io/github/v/release/Zilong-Li/vcfppR.svg)
[![codecov](https://codecov.io/github/Zilong-Li/vcfppR/graph/badge.svg?token=QE1UFVRH98)](https://app.codecov.io/github/Zilong-Li/vcfppR)
[![Downloads](https://cranlogs.r-pkg.org/badges/vcfppR?color=blue)](https://CRAN.R-project.org/package=vcfppR)
<!-- badges: end -->
The vcfppR package implements various powerful functions for fast genomics analyses with VCF/BCF files using the C++ API of [vcfpp.h](https://github.com/Zilong-Li/vcfpp).
- Load/save content of VCF/BCF into R objects with highly customizable options
- Visualize and chracterize variants
- Compare two VCF/BCF files and report various statistics
- Streaming read/write VCF/BCF files with fine control of everything
- [Paper](https://doi.org/10.1093/bioinformatics/btae049) shows vcfppR is 20x faster than `vcfR`. Also, much faster than `cyvcf2`
## Installation
``` r
## install.package("vcfppR") ## from CRAN
remotes::install_github("Zilong-Li/vcfppR") ## from latest github
```
## Cite the work
If you find it useful, please cite the [paper](https://doi.org/10.1093/bioinformatics/btae049)
```r
library(vcfppR)
citation("vcfppR")
```
## URL as filename
All functions in vcfppR support URL as filename of VCF/BCF files.
```{r data}
phasedvcf <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"
rawvcf <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_raw_GT_with_annot/20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr21.recalibrated_variants.vcf.gz"
svfile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20210124.SV_Illumina_Integration/1KGP_3202.gatksv_svtools_novelins.freeze_V3.wAF.vcf.gz"
popfile <- "https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/20130606_g1k_3202_samples_ped_population.txt"
```
## `vcftable`: read VCF as tabular data
`vcftable` gives you fine control over what you want to extract from VCF/BCF files.
**Read SNP variants with GT format and two samples**
```{r snp}
library(vcfppR)
res <- vcftable(phasedvcf, "chr21:1-5100000", samples = "HG00673,NA10840", vartype = "snps")
str(res)
```
**Read INDEL variants with DP format and QUAL>100**
```{r dp}
res <- vcftable(rawvcf, "chr21:1-5100000", vartype = "indels", format = "DP", qual=100)
vcfplot(res, which.sample = 10, ylim=c(0,60), col = 3, pch=19)
```
## `vcfcomp`: compare two VCF files and report concordance
Want to investigate the concordance between two VCF files? `vcfcomp` is the utility function you need! For example, in benchmarkings, we intend to calculate the genotype correlation between the test and the truth.
```{r r2}
res <- vcfcomp(test = rawvcf, truth = phasedvcf,
stats = "r2", region = "chr21:1-5100000",
formats = c("GT","GT"), setid = TRUE)
par(mar=c(5,5,2,2), cex.lab = 2)
vcfplot(res, col = 2,cex = 2, lwd = 3, type = "b")
```
```{r pse}
res <- vcfcomp(test = rawvcf, truth = phasedvcf,
stats = "pse", setid = TRUE,
region = "chr21:5000000-5500000",
samples = "HG00673,NA10840",
return_pse_sites = TRUE)
vcfplot(res, which=1:2, main = "Phasing switch error", ylab = "HG00673,NA10840")
```
Check out the [vignettes](https://zilong-li.github.io/vcfppR/articles/) for more!
## `vcfsummary`: variants characterization
Want to summarize variants discovered by genotype caller e.g. GATK? `vcfsummary` is the utility function you need!
**Small variants**
```{r summary_sm}
res <- vcfsummary(rawvcf,"chr21:10000000-10010000")
vcfplot(res, pop = popfile, col = 1:5, main = "Number of SNP & INDEL variants")
```
**Complex structure variants**
```{r summary_sv}
res <- vcfsummary(svfile, svtype = TRUE, region = "chr20")
vcfplot(res, main = "Structure Variant Counts", col = 1:7)
```
## R API of vcfpp.h
There are two classes i.e. ***vcfreader*** and ***vcfwriter*** offering the full
R-bindings of vcfpp.h. Check out the examples in the [tests](tests/testthat)
folder or refer to the manual, e.g. `?vcfppR::vcfreader`.
```{r test}
library(testthat)
svfile <- system.file("extdata", "sv.vcf.gz", package="vcfppR")
test_that("can change samples name and set genotypes for single sample", {
br <- vcfreader$new(svfile, "", "HG00096")
br$variant()
expect_identical(br$infoStr("SVTYPE"), "DUP")
expect_identical(br$genotypes(F), c(0L, 0L))
br$setGenotypes(c(1L,1L))
expect_identical(br$genotypes(F), c(1L, 1L))
outfile <- paste0(tempfile(), ".vcf.gz")
br$output(outfile)
br$updateSamples("ZZZZZ")
br$write()
br$close()
vcf <- vcftable(outfile)
expect_true(vcf$gt==2)
expect_true(vcf$samples=="ZZZZZ")
})
```