SPiP is a randomForest model running a cascade of bioinformatics tools. Briefly, SPiP uses SPiCE tool for the consensus splice sites (donor and acceptor sites), MES for polypyrimidine tract between -13 and -20, BPP for branch point area between -18 and -44, an homemade score to research cryptic/de novo activation and ΔtESRseq for exonic splicing regulatory element until to 120 nt in exon
SPiP is available for Windows OS at https://sourceforge.net/projects/splicing-prediction-pipeline/
Table
- SPiPv2.1_main.r: the SPiP script
- testCrypt.txt: an example of input data in text format
- testVar.vcf: an example of input data in vcf format
- RefFiles: folder where are the reference files used by SPiP
To get SPiP from this repository, you can enter in the linux consoles:
git clone https://github.com/raphaelleman/SPiP
cd ./SPiP
SPiP needs also to install 2 libraries, from the R console:
install.packages("foreach")
install.packages("doParallel")
install.packages("randomForest")
you have to download frome sourcforge the RData files containing the transcripts sequences. hg19 assembly : transcriptome_hg19.RData hg38 assembly : transcriptome_hg38.RData
Put these files in /path/to/SPiP/RefFiles/
or you can define it manually by the option --transcriptome
.
NB: commands to regenerate these files are available in getGenomeSequenceFromBSgenome.r
you can get the different argument of SPiP by Rscript /path/to/SPiPv2.1_main.r --help
An example of SPiP run with test file testCrypt.txt:
cd /path/to/SPiP/
Rscript ./SPiPv2.1_main.r -I ./testCrypt.txt -O ./outputTest.txt
In this example SPiP will generate a text file "outputTest.txt" where the predictions will be save. The scheme of this output is:
Column names | Example | Description |
---|---|---|
varID | NM_007294:c.213-6T>G | The variant id (Transcript:mutation) |
Interpretation | Alter by SPiCE | The overall prediction |
InterConfident | 92.9 % +/- 2.1 % | The risk that the variant impact splicing Estimated from collection of variant with in vitro RNA studies and frequent variant |
chr | chr17 | Chromosome number |
strand | - | Strand of the junction ('+': forward; '-':reverse) |
varType | substitution | Type of variant |
ntChange | T>G | Nucleotides variation |
ExonInfo | Intron 4 (1499) | Number and size of Exon/Intron |
transcript | NM_007294 | Transcript (RefSeq) |
gene | BRCA1 | Gene symbol (RefSeq) |
gNomen | 41256979 | Genomic position of variant |
seqPhysio | ACGG...AGGA | (A, C, G, T)-sequence before the mutation |
seqMutated | ACGG...AGGA | (A, C, G, T)-sequence after the mutation |
NearestSS | acceptor | The nearest natural splice site to the variant |
distSS | -6 | Distance between the nearest splice site and the mutation |
RegType | IntronCons | The type of region where located the variant |
SPiCEproba | 1 | The SPiCE probability for variant in consensus splice site |
SPiCEinter_2thr | high | The SPiCE classes (high/medium/low) |
deltaMES | 0 | MES variation for variant in the polypyrimidine tract |
mutInPBarea | No | If the mutation is located in branch point predicted by BPP tool |
deltaESRscore | NA | ESR score variation for exonic variant |
posCryptMut | 41256978 | Genomic position of cryptic splice site after the mutation |
sstypeCryptMut | Acc | Splice type of cryptic splice site after the mutation |
probaCryptMut | 0.000710404942432828 | Score of cryptic splice site after the mutation |
classProbaCryptMut | No | Prediction of cryptic splice site after the mutation (Yes: used, No: Not used) |
nearestSStoCrypt | Acc | Splice type of the nearest natural splice site |
nearestPosSStoCrypt | 41256973 | Genomic position of the nearest natural splice site |
nearestDistSStoCrypt | -5 | Distance between the cryptic site and the natural site |
posCryptWT | 41256970 | Genomic position of cryptic splice site before the mutation |
probaCryptWT | 4.89918764104143e-07 | Score of cryptic splice site before the mutation |
classProbaCryptWT | No | Prediction of cryptic splice site before the mutation (Yes: used, No: Not used) |
posSSPhysio | 41256973 | Genomic position of natural splice site that same splice site type of the mutated cryptic |
probaSSPhysio | 0.00408919066993282 | Score of natural splice site that same splice site type of the mutated cryptic |
classProbaSSPhysio | Yes | Prediction of natural splice site that same splice site type of the mutated cryptic (Yes: used, No: Not used) |
probaSSPhysioMut | 1.74991364794327e-06 | Score of natural splice site that same splice site type of the mutated cryptic after the mutation |
classProbaSSPhysioMut | No | Score of natural splice site that same splice site type of the mutated cryptic after the mutation (Yes: used, No: Not used) |
-I, --input /path/to/inputFile
- list of variants file (.txt or .vcf). SPiP supports VCF version 4.1 or later (see example testVar.vcf). The txt file must be tab-delimated and the column with mutation, in format Transcript:mutation, is indicated by 'varID' column name (see example testCrypt.txt).
-O, --output /path/to/outputFile
- Name of ouput file (.txt). Directory to the output file (in text format)
-g, --GenomeAssenbly hg19
- Genome assembly version (hg19 or hg38) [default= hg19]
-t, --threads N
- Number of threads used for the calculation [default= 1]
-l, --maxLines N
- Number of lines read in each time [default= 1000]
--verbose
- Show run process, i.e. displays progression bar tool
--geneList /path/to/geneList.txt
- You can process analysis exclusively on a gene list, available only if VCF input
--transcriptList /path/to/transcriptList.txt
- You can process analysis exclusively on a transcript list, available only if VCF input
--transcriptome /path/to/transcriptome_hgXX.RData
- You can define where you have installed the file transcriptome_hgXX.RData if your file is not in /path/to/SPiP/RefFiles/
--VCF
- Get the SPiP output in VCF format (v4.0)
# dynamic line modified in script : paste0("##SPiP output v",version)
# dynamic line modified in script : paste0("##SPiPCommand=",CMD)
## SPiP=altUsed|varID|Interpretation|InterConfident|SPiPscore|strand|gNomen|varType|ntChange|ExonInfo|exonSize|transcript|gene|NearestSS|DistSS|RegType|SPiCEproba|SPiCEinter_2thr|deltaMES|BP|mutInPBarea|deltaESRscore|posCryptMut|sstypeCryptMut|probaCryptMut|classProbaCryptMut|nearestSStoCrypt|nearestPosSStoCrypt|nearestDistSStoCrypt|posCryptWT|probaCryptWT|classProbaCryptWT|posSSPhysio|probaSSPhysio|classProbaSSPhysio|probaSSPhysioMut|classProbaSSPhysioMut
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 15765825 NM_007272:g.15765825:G>A G A . . SPiP=A|NTR|00.04 % [00.02 % ; 00.08%]|+|substitution|G>A|Intron 1 (1795)|NM_007272|CTRC|donor|825|DeepIntron|0|Outside SPiCE Interpretation|0|No|NA|15765816|Acc|0.00206159394907144|No|Don|15765000|816|15765816|0.00161527498798199|No|15766795|0.0775463330795674|Yes|0.0775463330795674|Yes
- Raphael Leman - raphaelleman
- You can contact me at: [email protected] or [email protected]
Cite as: Leman, R., Parfait, B., Vidaud, D.,Girodon, E., Pacot, L., Le Gac, G., Ka, C., Ferec, C., Fichou,Y., Quesnelle, C., Aucouturier, C., Muller, E., Vaur, D.,Castera, L., Boulouard, F., Ricou, A., Tubeuf,H., Soukarieh,O., Gaildrat, P., Krieger, S. (2022). SPiP: Splicing Prediction Pipeline, a machine learning tool for massive detection of exonic and intronic variant effects on mRNA splicing. Human Mutation
This project is licensed under the MIT License - see the LICENSE file for details