GitHub

Installation

This repository just contains standalone scripts. Make sure to install requirements before running:

pip install -r requirements.txt

Expected format

HiGlass expects the gene annotations file to have following format:

# 1: chr (chr1)
# 2: txStart (52301201) [9]
# 3: txEnd (52317145) [10]
# 4: geneName (ACVRL1)   [2]
# 5: citationCount (123) [16]
# 6: strand (+)  [8]
# 7: refseqId (NM_000020)
# 8: geneId (94) [1]
# 9: geneType (protein-coding)
# 10: geneDesc (activin A receptor type II-like 1)
# 11: cdsStart (52306258)
# 12: cdsEnd (52314677)
# 13: exonStarts (52301201,52306253,52306882,52307342,52307757,52308222,52309008,52309819,52312768,52314542,)
# 14: exonEnds (52301479,523063

This bed-like format then needs to be aggregated using clodius aggregate bedfile in order to limit the amount of data displayed at once and to enable searching by gene name.

Example 1: From UCSC GTF file

Download the UCSC gtfToGenePred binary from http://hgdownload.soe.ucsc.edu/admin/exe/
Get the GTF and chromsizes files for an assembly (the -NP . parameters ensure that a file isn't downloaded if it's already present) and convert to genepred format:

wget -NP . https://hgdownload.soe.ucsc.edu/goldenPath/danRer10/bigZips/genes/danRer10.refGene.gtf.gz
wget -NP . https://hgdownload.soe.ucsc.edu/goldenPath/danRer10/bigZips/danRer10.chrom.sizes
gtfToGenePred -genePredExt -geneNameAsName2 danRer10.refGene.gtf.gz danRer10.refGene.genepred

Convert to higlass-compatible format:


cat danRer10.refGene.genepred | python genepredext_to_hgbed.py | python exonU.py - > danRer10.refGene.hgbed
clodius aggregate bedfile --chromsizes-filename danRer10.chrom.sizes danRer10.refGene.hgbed

Use in either HiGlass or Resgen using filetype:beddb, datatype:gene-annotations.

Example 2: From NCBI GFF

Find the genome information page for sacCer3 at https://www.ncbi.nlm.nih.gov/assembly/GCF_000146045.2/.

Download the gff file by clicking on "Download Assembly" and selecting "Genomic GFF".

Convert to higlass-compatible format using these commands:

gzcat GCF_000146045.2_R64_genomic.gff.gz \
	| python scripts/gff_to_jsonl.py - \
	| python scripts/gjsonl_to_chromsizes.py - > sacCer3.chrom.sizes

gzcat GCF_000146045.2_R64_genomic.gff.gz \
	| python scripts/gff_to_jsonl.py - \
	| python scripts/gjsonl_to_hgbed.py - > sacCer3.hgbed

clodius aggregate bedfile sacCer3.hgbed \
	--delimiter $`\t' \
	--chromsizes-filename sacCer3.chrom.sizes

The sacCer2.chrom.sizes file just contains the names of the chromosomes and their sizes.

View in higlass:

higlass-manage view sacCer3.hgbed.beddb --datatype gene-annotations

Note that this process omits all RNAs and takes the union of all exons in a gene to represent it as if it were just one transcript.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
scripts		scripts
README.md		README.md
main.nf		main.nf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Expected format

Example 1: From UCSC GTF file

Example 2: From NCBI GFF

About

Releases

Packages

Languages

higlass/gene_annotations

Folders and files

Latest commit

History

Repository files navigation

Installation

Expected format

Example 1: From UCSC GTF file

Example 2: From NCBI GFF

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages