You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am testing the data portal and noticed a couple of issues regarding polymorphic genes. The same algorithms are used to process RNA-seq reads for them as for all other genes. This is unsuitable because of how many variants there are in the human population and how similar these genes are to each other. For example, in recent work we found that patients which didn't have HLA-G expressed according to laboratory experiments had high counts for HLA-G by RNA-seq. Upon further investigation, we realised that the reads mapping to HLA-G had a mismatch score only 1 less than the mismatch score to HLA-A in the reference genome. The alternative approach we implemented is:
Replace sequence in hg38 where HLA and KIR genes are by N to force reads not to map there.
Use an RNA-seq aligner to map the reads to the modified reference genome sequence and output the unmapped reads to a separate FASTQ file.
Take the unmapped reads and the IMGT HLA database (contains thousands of alleles for each gene) and use RSEM to determine where the reads should really go.
Use the reads mapped to the masked reference sequence to process all other genes (i.e. the non-polymorphic ones).
We found that this approach meant that the results matched the biologists' experimental results and avoided reference sequence bias, which is usually not a problem for most of the genes in the genome which are highly conserved and don't have paralogs like HLA and KIR genes do.
AC:
Determine if/when/why/how we want to handle Polymorphic gene regions
TODO: Kylee better understand use case for doing this at all.
From Dario Strbenac on HCA Zendesk ([email protected]):
I am testing the data portal and noticed a couple of issues regarding polymorphic genes. The same algorithms are used to process RNA-seq reads for them as for all other genes. This is unsuitable because of how many variants there are in the human population and how similar these genes are to each other. For example, in recent work we found that patients which didn't have HLA-G expressed according to laboratory experiments had high counts for HLA-G by RNA-seq. Upon further investigation, we realised that the reads mapping to HLA-G had a mismatch score only 1 less than the mismatch score to HLA-A in the reference genome. The alternative approach we implemented is:
We found that this approach meant that the results matched the biologists' experimental results and avoided reference sequence bias, which is usually not a problem for most of the genes in the genome which are highly conserved and don't have paralogs like HLA and KIR genes do.
AC:
┆Issue is synchronized with this Jira Spike
The text was updated successfully, but these errors were encountered: