WGSA is an annotation pipeline designed for human genome re-sequencing studies. It integrates various annotation resources and bioinformatics tools to provide comprehensive annotations for both single nucleotide variants (SNVs) and insertions/deletions (indels).
- annovar20200608: ANNOVAR installation directory
- configs: Configuration files for running WGSA
- htslib: HTSlib installation directory
- input: Input files (e.g., input/dbSNP/10.vcf)
- res: Result folder
- resources: Annotation resources (very large)
- scripts: Scaffold scripts and configuration file templates
- slurm: Automatically generated SLURM scripts
- snpeff: SnpEff installation directory
- tmp: Temporary files
- vep: VEP installation directory
- work: Work directory
- WGSA08.class: WGSA class file
Create the directory structure:
bash wgsa_095_pipeline/create_dir.sh wgsa_095
cd wgsa_095
Download WGSA.class and annotation resources from WGSA Page.
Install ANNOVAR, VEP, and SnpEff. See the provided URL for instructions.
Create a new directory in the input folder and place all input VCF files in it.
Modify run_work.sh (e.g., run_work.dbSNP.sh).
- config.py: Generates specific configurations for each VCF file using a template (config.temp).
- sbatch.py: Creates SLURM batch files for each VCF file, based on the sbatch.temp template.
- sbatch.temp: A template for an HPC SLURM batch file, specifying resources and running the WGSA pipeline.
These scripts are integrated into the run_work.sh workflow and can be customized as needed.
config.py This Python script reads a template file (config.temp) and takes five command-line arguments to format the template with specific paths for the base directory, input, output, work, and temporary directories. The formatted template is then printed to the standard output.
sbatch.py This Python script reads a template file (sbatch.temp) and takes five command-line arguments to format the template with specific paths for the base directory, configuration file, configuration directory, SLURM output, and SLURM error logs. The formatted template is then printed to the standard output.
sbatch.temp This is a template for an HPC SLURM batch file. It specifies the resources required for the job (e.g., number of tasks, time, memory) and includes placeholders for various paths and configurations. The script loads the Java Development Kit (JDK) module, runs the WGSA pipeline with specific parameters, and executes a bash script to redirect the output and error logs.
The run_work.sh script utilizes these files as part of the workflow:
- It uses config.py to generate specific configurations for each VCF file.
- It uses sbatch.py to create SLURM batch files for each VCF file, based on the sbatch.temp template.
- It submits the SLURM batch files to the HPC cluster for processing.
!important: Make sure you know what's in the script and change placeholder values
bash run_work.sh [work_name] [base_wgsa_dir]
where work_name might be HRC_2023 and base_wgsa_dir is the base abs path for your wgsa resources folder created above i.e. /scratch2/username/annoq_data_builder/wgsa_095
Results will be generated in the res folder. Temporary files will be placed in the tmp and work directories. SLURM scripts can be found in the slurm directory (see scripts/sbatch.temp).
The Java Module in /java_wgsa_add can be used to add the PANTHER and Enhancer annotations The Java module requires the annotation file generated via PANTHER API. It can be generated as follows:
- cd annoq-data-builder
- Setup environment as follows: * python3 -m venv env * . env/bin/activate * pip3 install -r requirements.txt
- python3 tools/api_extractor/panther_gene_extractor.py --output panther_annot.json
- copy panther_annot.json to location specified in ./annoq-data-builder/java_wgsa_add/add_panther_enhancer/src/main/resources/add_panther_enhancer.properties or modify the property to point to location of file
This part scripts designed for data preparation and processing, a crucial pre-step for before adding annotations. The scripts automate various tasks to clean and organize data for efficient annotation.
The tools/scripts/run_pre_work.sh
script automates several Python scripts, each handling a specific part of data preparation. This script streamlines the process, ensuring data is correctly formatted and optimized for future AnnoQ tasks.
-
Extracting Terms from Panther Data: This step extracts terms from the
panther_data.json
file to create a map of IDs and labels. This map is essential for future term label lookups on the site. -
Removing Labels from Panther Data: Labels are not required in the Elasticsearch index, so this step removes them from the
panther_data.json
file, resulting in a cleaner dataset for indexing. -
Creating an Interval Tree: An interval tree is generated for quick searches and efficient data retrieval, using the
Homo_sapiens.GRCh38.pep.all.fa
andUP000005640_9606.idmapping
files. -
Creating an Enhancer Map: The final step creates an enhancer map using the label-free
panther_data.json
file. This map facilitates faster data annotation processes.
The panther_data.json
initially contains columns for both labels and IDs. For example:
{
"cols": [
"GO_molecular_function_complete_list",
"GO_molecular_function_complete_list_id",
// other columns...
],
"data": {
"HGNC:2602": [
"label1|label2|label3",
"GO:0008403|GO:0020037|GO:0030342",
// other data...
]
}
}
The script processes this file to remove label columns, leaving only the ID columns:
{
"cols": [
"GO_molecular_function_complete_list_id",
// other columns...
],
"data": {
"HGNC:2602": [
"GO:0008403|GO:0020037|GO:0030342",
// other data...
]
}
}
The extracted terms are then formatted as a simple key-value pair:
{
"GO:0010070": {
"id": "GO:0010070",
"label": "zygote asymmetric cell division"
},
// other terms...
}
The enhancer map is an expanded file, formatted for quick reference and efficient processing:
[
{
"chrNum": "chr21",
// other enhancer data...
"data": {
"GO_molecular_function_complete_list_id": ["GO:0004620", /* more IDs */],
// other data...
}
},
// other entries...
]
Usage Run the script with the command below from the repository's root directory:
bash tools/scripts/run_pre_work.sh ./../annoq_data
Ensure the annoq_data directory is correctly located relative to the script's path.
'pip install -r requirements.txt
cd tools
Take a look in scripts/run_decode-pickle.sh and add files accordingly outside the repo
Then run
sh scripts/run_decode_pickle.sh
The coord_to_intervaltree.py
script contains wrapper classes for IntervalTree
and Interval
objects: PantherIntervalTree
and PantherInterval
. You can quickly extract TSV and JSON formatted coordinates given a HUMAN peptide FASTA --pep_fasta
file and Reference Proteome HUMAN ID mapping --idmapping
file:
wget ftp://ftp.ensembl.org/pub/release-80/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Eukaryota/UP000005640/UP000005640_9606.idmapping.gz
gunzip Homo_sapiens.GRCh38.pep.all.fa.gz
gunzip UP000005640_9606.idmapping.gz
python3 tools/coord_to_intervaltree.py -p Homo_sapiens.GRCh38.pep.all.fa -i UP000005640_9606.idmapping > parsed_coords.tsv
python3 tools/coord_to_intervaltree.py -p Homo_sapiens.GRCh38.pep.all.fa -i UP000005640_9606.idmapping --json > parsed_coords.json
TSV:
ENSG00000228985 14 22449113 22449125 1
ENSG00000223997 14 22438547 22438554 1 HGNC:12254
ENSG00000282253 CHR_HSCHR7_2_CTG6 142847306 142847317 1 HGNC:12158
JSON:
[
[
"ENSG00000228985",
["14", 22449113, 22449125, "1"]
],
[
"ENSG00000223997",
["14", 22438547, 22438554, "1"],
"HGNC:12254"
],
[
"ENSG00000282253",
["CHR_HSCHR7_2_CTG6", 142847306, 142847317, "1"],
"HGNC:12158"
],
]
-
Update file annoq-site/metadata/annotation_tree.csv to reflect any metadata changes
-
Setup environment as follows: python3 -m venv env
. env/bin/activate
pip3 install -r requirements.txt
python3 -m tools.annotation_tree_gen --input_csv /path/to/annoq-site/metadata/annotation_tree.csv --output_csv /do/not/use/annotation_tree_output.csv --output_json /path/to/annoq-api/data/anno_tree.json --mappings_json /path/to/annoq-database/metadata/annoq_mappings.json --api_mappings_json /path/to/annoq-api-v2/data/api_mapping_anno_tree.json
- Copy anno_tree.json into /annoq-api/data/anno_tree.json
- Copy anno_tree.json into /annoq-api-v2/data/anno_tree.json
- Copy annoq_mappings.json and into annoq-database/metadata/annoq_mappings.json
DO NOT overwrite file annoq-site/metadata/annotation_tree.csv with /do/not/use/annotation_tree_output.csv since some fields may get lost
python3 /path/to/annoq-data-builder/tools/mappings_data_type_gen.py --input /path/to/annoq-site/metadata/annotation_tree.csv --output /annoq-database/data/doc_type.pkl --anno_tree /do/not/use/anno_tree.json -d ,
copy doc_type.pkl into /annoq-database/data/doc_type.pkl DO NOT overwrite file /annoq-api/data/anno_tree.json with /do/not/use/anno_tree.json since some fields are not generated by this script