The majority of data sources are either provided in this repository or automatically downloaded via scripts. However, some need to be manually obtained and saved into the folder structure prior to executing the analysis pipeline.
To download most of the data sources, simply execute:
./download.sh
and (note: requires python 3 to run):
python3 download_psicquic.py
The following data sources need to be manually obtained, since they are not (yet) publicly available:
- Dana Farber CCSB HI-2012 PPI network (see below)
If you don't want to sign up to get this data, you can instead download one of
the older Human Interactomes from their website and name it HI_2012_PRE.tsv
in
the ppi/
folder. Note that the results from the analysis pipeline will be
different for this network, since it uses different data. All other results
should remain identical. The pipeline will not execute if this file is not
available.
From the supplementary section of the paper "Tissue specificity and the human protein interaction network".
This can be automatically downloaded via the download_data.sh
script.
The Human Interactome 2012 is still in preliminary form, thus you have to sign up at the CCSB and to download the Human interactome database here.
Download the HI_2012_PRE.tsv
file and save it into the ppis/
folder.
This is the protein complex network from the paper "A Census of Human Soluble Protein Complexes" by Havugimana et al.
Download the network from the supplemental information Table S2 here
The protein-protein interactions are in the Excel sheet "14K Denoised PPI". Copy the first two columns into a tab-separated text file with headers "Gene1" and "Gene2".
Right now this network is already part of the repository. This might complicate giving free access to the repo, as I am currently unsure about the licensing of the PPI network from Havugimana et al.
The protein interaction data can be loaded from string-db
automatically with the download_data.sh
shell script.
If you already downloaded the string-db file or import it manually for other reasons, do the following:
Go to string-db to the Download section and download the protein.links.vX.XX.txt.gz file. This file contains all protein protein interactions using Ensembl IDs and a reliability score per interaction.
Direct download link (for version 9.05): here
Save this file into the data/download
folder, and then run
the download_data.sh
script, which will pre-filter the
string-db interactions for human interactions with score >= 0.7
Run the download_psicquic.py
python script in this folder in order to download
the different PSICQUIC provided PPI networks.
These are all downloaded and unpacked automatically via the
download_data.sh
script. The following just describes the
data sources.
From the Human Protein Atlas version 11
the normal_tissue.csv
and subcellular_location.csv
files are used.
Manual download here
This is RNA micro-array data from Su Al et al. 2009. It can be downloaded from BioGPS here.
The files gnf1h-gcrma.zip
and gnf1h-anntable.zip
are needed, where
the first is the actual expression data while the second holds the annotation
for the genes.
The RNAseq data from Illumina Body Map can be downloaded from EBI here.
The automatic script removes all filters (no gene selection, especially not only protein coding genes; and a cutoff of 0).
This data is also publicly available and can be downloaded from here.
This is also automatically downloaded with the download.sh
script.
Go to: ensembl.org
Choose "Ensembl Genes 71" (or current version) and table "Homo sapiens genes"
Include following fields for the table:
- Ensembl Gene ID
- Ensembl Protein ID
- Associated Gene Name
- UniProt/SwissProt ID
- HGNC ID(s)
- EntrezGene ID
Export the table as CSV (and choose "Unique results only") and save this into
the file mapping/mart_export.csv
.
Currently BioMart version 71 is provided in the repository.
Get data from: genenames.org
Goto Locus Group
: "protein-coding gene" and click "Custom".
Choose only the Columns:
- HGNC ID
- Approved Symbol
- Approved Name
- Status
- Entrez Gene ID
- Ensembl Gene ID (and from external sources)
- Entrez Gene ID (supplied by NCBI)
- UniProt ID (supplied by UniProt)
- Ensembl ID (supplied by Ensembl)
Make sure to deselect (exclude) the status: "Entry and Symbol Withdrawn"
Full URL to results: HGNC Mapping
This should return all 19060 genes.
Save this file into the mapping/hgnc_downloads.txt
.
A version of this file is provided in the repository (not guaranteed to be up
to date).