HLA-based quality control of RNA-seq datasets
Tool extracts the HLA types of I and II classes from all the files in the folder containing raw RNA-seq data (paired- on single-end). The alleles are then cross-compared between the RNA-seq samples to identify the common source of the samples based on HLA types (4 digital resolution).
Dr. Irina Chelysheva, 2019-2023 (c)
Oxford Vaccine Group, Department of Paediatrics, University of Oxford
Contact
$ python RNA2HLA.py -f /raw_RNAseq_data_folder [-r /global_name_of_run] [-p <int>] [-3 <int>] [-c <float>] [-g <int>]
-f
is required for running RNA2HLA. Folder should contain raw RNA-seq samples, single- or paired-end or both types, in a compressed or not compressed formats.
Optional parameters:
-r
to be used as a prefix for all output files-p
number of parallel search threads for bowtie (default: 6)-3
trim bases from the low-quality end of each read-c
confidence level for HLA-typing (default: 0.05)-g
number of HLA genes to be included for typing (default: 5, may be increased to 6 - adding DQB1)
- RNA2HLA is a Python script (available in two versions: for Python 2 and Python 3 (coming soon)).
- All the dependencies provided within RNA2HLA depository (Python scripts single_end.py and paired_end.py, function scripts in R and Python, HLA class I and II databases) must be downloaded and located in the same folder.
- Index files must be downloaded and located in subfolder /references.
- Ther easiest way to run RNA2HLA is to create a conda environment using RNA2HLA_env.yml file provided:
$ conda env create -f RNA2HLA_env.yml
And activate it:
$ source activate RNA2HLA_env
or$ conda activate RNA2HLA_env
(depends on the conda version)
Update from 2.04.2021: One user reported an error while trying to create an environment from the original yml file (this error does not appear in most cases). If you experience an error, please, use an alternative environment file RNA2HLA_env_alt.yml instead.
Otherwise:
4a) bowtie must be reachable by the command bowtie
(developed with version 1.1.2)
4b) R must be installed.
4c) Packages: biopython (developed with 1.76), numpy (developed with 1.16.6, !this version caused an error for one user, therefore - 1.15 is preferable), pandas (developed with 0.24.2)
The final output - overall comparison matrix in csv format, which cross-compares all RNA-seq samples in the given folder.
Individual outputs in txt format produced for each RNA-seq sample in the folder (classes I and II are written in one file):
- .bowtielog.txt - file with statistics of HLA mapping;
- .ambiguity.txt - reports typing ambuigities (if more than one solution for an allele possible based on the expression and HLA databases);
- .expression.txt - RPKM expression of HLA;
- .HLAgenotype4digits.txt - 4 digital HLA type.
Update from 9.03.2023: v1.1 Heatmap can be created from the overall comparison matrix csv file using an R script heatmap_HLA_identity_comparison.R
In the case of studying a particular population with prior knowledge of the low HLA allele diversity, RNA2HLA should not be used as a QC, but only as a convenient study-wide HLA-typing method. One can refer to the Allele Frequency Net Database and discover HLA diversity of particular population through the interactive map. The populations with less than 50 of total known alleles should be considered as of low diversity.
1.0: initial tool
Please, cite the following publication, if you are using RNA2HLA in your research: Irina Chelysheva, Andrew J Pollard, Daniel O’Connor, RNA2HLA: HLA-based quality control of RNA-seq datasets, Briefings in Bioinformatics, 2021