Haplotype-aware Hidden Markov Model for RNA (HaHMMR) is a method for detecting copy number variations (CNVs) from bulk RNA-seq data. Extending the haplotype-aware HMM in Numbat for single-cell RNA-seq, HaHMMR offers enhanced capabilities for detecting low-clonality CNVs from bulk data.
This repo contains HaHMMR as a standalone package and as a submodule of Numbat (v1.4.0+).
For details of the method, please check out our paper(s):
Install via CRAN:
install.packages("hahmmr")
Or install the latest GitHub version using devtools
:
devtools::install_github("https://github.com/kharchenkolab/hahmmr")
First, obtain expression counts and phased allele counts from the RNA-seq sample. The expression counts can be prepared using a transcript quantification tool such as Salmon. The phased allele counts can be prepared using the pileup_and_phase.R pipeline from Numbat. A Docker container is available for running this pipeline.
For example, within the Numbat Docker you can run pileup_and_phase
in bulk RNA-seq mode like this:
Rscript /numbat/inst/bin/pileup_and_phase.R \
--bulk \
--label {sample} \
--samples {sample} \
--bams /mnt/mydata/{sample}.bam \
--outdir /mnt/mydata/{sample} \
--gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \
--snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \
--paneldir /data/1000G_hg38 \
--ncores ncores
The integer expression counts (count_mat
) should be a one-column matrix where rownames are genes and colname is the sample name. The phased allele counts (df_allele
) should be a dataframe containing columns snp_id
, CHROM
, POS
, cM
(genetic distance in centimorgan), REF
, ALT
, AD
(ALT allele count), DP
(total allele count), GT
(phased genotype), gene
.
Here is an example using the RNA-seq samples from a Meningioma study.
library(dplyr)
library(hahmmr)
allele_counts = data.table::fread('http://pklab.med.harvard.edu/teng/data/hmm_example/MN-5_TUMOR_allele_counts.tsv.gz')
gene_counts = readRDS(url('http://pklab.med.harvard.edu/teng/data/hmm_example/MN_gene_counts.rds'))
Sample MN-1037 has a diploid genome so we can use it to create a reference expression profile.
ref_internal = gene_counts[,'MN-1037_TUMOR',drop=F] %>% {./sum(.)}
head(ref_internal)
## MN-1037_TUMOR
## 7SK 0.000000e+00
## A1BG 1.107976e-06
## A1BG-AS1 5.003764e-07
## A1CF 3.574117e-08
## A2ML1 3.931529e-07
## A4GALT 9.314150e-05
We can now analyze it using HaHMMR.
sample = 'MN-5_TUMOR'
bulk = get_bulk(
count_mat = gene_counts[,sample,drop=F],
df_allele = allele_counts,
lambdas_ref = ref_internal,
gtf = gtf_hg38
) %>%
analyze_joint()
bulk %>% plot_psbulk(min_depth = 15)