Skip to content

christopherreeder/germ

Repository files navigation

Overview

GERM is an approach to identifying conditional occupancy events for a protein from ChIA-PET data.

How To Run

Applying GERM to ChIA-PET data requires several steps. GERM is implemented to run on a cluster of machines through the Sun Grid Engine queuing system. It is assumed that ChIA-PET sequence data have been appropriately processed to remove chimeric read pairs and that the linker sequences have been removed from the remaining sequence data. The remaining non-chimeric genomic sequence read pairs should be aligned to the appropriate reference genome. Reads from each pair should be aligned independently because no assumptions should be made about the locations of the reads in a pair relative to each other. It is assumed that input files reflect all read pairs such that both reads in each pair align to a unique location in the reference genome. Input files are expected to be tab-delimited with a pair of genomic locations on each line corresponding to the aligned locations of a read pair. For example:

11:22793448:+ 13:56522051:- 2:75705251:+ 5:53998331:- 11:106929428:- 11:106929538:+ 15:99392393:- 15:99393022:+ 12:104434000:- 3:96247693:-

Running all of the following commands with germ.jar on the classpath should include all necessary dependencies. In order to estimate the read spread function (RSF), -+ read pairs must be extracted:

edu.mit.csail.cgs.reeder.germ.FilterSameChromInter --species "Mus musculus;mm9" --infile --outfile <-+ read pair file> --distcutoff 80 --self

qsub commands must be generated for a preparatory step that must be run prior to running blind deconvolution. This requires a file specifying genomic regions with potentially interesting patterns of enrichment. For example, regions containing actively transcribed genes. This must be a tab-delimited file in which the first column contains genomic regions. Another required file contains the likelihood that a -+ read pair was generated by self-ligation given the distance it spans. This file can be computed as described in the paper:

edu.mit.csail.cgs.reeder.germ.SubmitPrep --submitfile --genomefile "mm10_1.txt" --targetsize 50000 --h 500 --binsize 10 --readfile <-+ read pair file> --regionfile --selfliglikfile --storagesize <# lines in read pair file> --clenoutbasebase "marginal_test_target_clen" --foutbasebase "marginal_test_target_f" --coutbasebase "marginal_test_target_c" --outbase "marginal_test_target_" --wd

Blind deconvolution can now be run using the prepared files:

edu.mit.csail.cgs.reeder.parzen.SubmitBlindDeconvolve --submitfile --minnum 0 --maxnum 80 --numouterinters 10 --numinnerinters 20 --finbase "marginal_test_target_f" --cinbase "marginal_test_target_c" --clenbase "marginal_test_target_clen" --numc 1 --goutbase "marginal_test_target_gout" --foutbase "marginal_test_target_fout" --coutbase "marginal_test_target_cout" --outbase "marginal_test_target_bd_" --wd --postname "_0.txt"

The individual blur functions can now be combined to obtain a high confidence estimate of the RSF:

edu.mit.csail.cgs.reeder.parzen.CombineBlurFunctions --minfile 0 --maxfile 80 --trim 50 --gbase "marginal_test_target_gout" --outfile "marginal_test_target_goutmaster.txt"

We also need to estimate the marginal distribution:

edu.mit.csail.cgs.reeder.germ.ApproxDeconvolveFull --genomefile "mm9_1.txt" --h 500 --readfile <-+ read pair file> --selfliglikfile --chromfile "mm9_1.txt" --storagesize <# lines in read pair file> --band 10 --binsize 10 --foutfile "marginal.wig"

The resulting wig file needs to be converted to a bigwig file using the command line tools provided by UCSC.

In order to estimate conditional probabilities of occupancy, non-self-ligation read pairs need to be extracted:

FilterSameChromInter --species "Mus musculus;mm10" --infile --outfile --distcutoff 80

To obtain an estimated distribution of single end reads relative to occupied locations, marginalize the estimated RSF.

We now have the necessary components for estimating the conditional probability of occupancy given a set of TSSs or other points. The interthresh flags specify thresholds used to delimit regions to be considered as members of potential conditional occupancy events. In this example, a range of thresholds are considered from -17 to -13, with more positive thresholds being more stringent:

SubmitTSSSpec --submitfile --genomefile "mm9_1.txt" --margfile --outbase "tssspec" --numruns 10 --backval 1.016076539653688e-17 --numsamps <# lines in read pair file> --readdistfile --readfile --storagesize <# lines in read pair file> --perrun <# points from tss file to process per run> --wd --tssfile --jointsum --interthreshlow -17 --interthreshhigh -13

Concatenate the output files for each threshold to obtain files such as tssspec_17_all.txt. We must now compute the significance of conditional occupancy events:

DeterminePseudo --genomefile "mm10_1.txt" --binsize 10 --infile "tssspec_17_all.txt" --auxfile "" --margfile --outfile "tssspec_17_all_pvalsaux.txt" --specoutfile "tssspec_17_all_pvals.txt" --minmult 7 --maxmult 100 --multtries 1 --startinc 1 --mininc .001 --pcutoff 0.05 --fdrcutoff 0.1 --numsamps <# lines in non-self read pair file>

The resulting file tssspec_17_all_pvals.txt will contain conditional occupancy events with corrected p-values at the -17 threshold.

Contact reeder.c at gmail

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published