-
Notifications
You must be signed in to change notification settings - Fork 4
Compute spike in normalization factor from BAM alignment maps
The lab's ChIP-seq spike-in method allows you to get semi-quantitative measurements of target protein abundance. It can be summarized in the following main steps:
- Mix-in the SK288c spike-in strain cells with the experimental strain cells (typically SK1) before ChIP;
- Perform regular ChIP;
- Sequence spiked control and test sample ChIP libraries;
- Align the output reads to a hybrid genome between the experimental strain (typically SK1) and the SK288c spike-in strain (using the lab's pipeline: ChIPseq_Pipeline_hybrid_genome).
- Compute a normalizing constant by which to multiply the test sample data to allow semi-quantitative comparison to the reference sample.
The following is an illustration of how to perform the last step: computing the normalization factor.
We will be working with two example samples, a control and a test sample. Both strains were constructed on the SK1 background and the cells were mixed with SK288C spike-in cells (at a proportion of 80%:20%) before ChIP. We will start with the BAM alignment maps for the two strains, each with a ChIP and an Input sample.
$ ls
ctrl_chip.bam
ctrl_input.bam
test_chip.bam
test_input.bam
We need to compute the total number of reads mapped to each genome (SK1 and S288C) in each sample. We will use Samtools for that, but first we need to sort and index the BAM files. Let's use a for loop
to save some typing.
Note: If you are working with SAM files, I suggest converting to BAM format first:
samtools view -Sb your_file.sam > your_file.bam
Note 2: If you used the lab's ChIP-seq pipeline for hybrid genomes you already have sorted and indexed BAM files. You can safely skip this step and start at the Samtools
idxstats
step below.
for FILE in *.bam
do
NEW_FILE=${FILE%.*}_sorted.bam
samtools sort -o ${NEW_FILE} -O 'bam' -T 'temp' ${FILE}
samtools index ${NEW_FILE}
done
We can use Samtools' idxstats
to count the number of reads aligned to each chromosome. Here's the updated list of files.
$ ls
ctrl_chip.bam
ctrl_chip_sorted.bam
ctrl_chip_sorted.bam.bai
ctrl_input.bam
ctrl_input_sorted.bam
ctrl_input_sorted.bam.bai
test_chip.bam
test_chip_sorted.bam
test_chip_sorted.bam.bai
test_input.bam
test_input_sorted.bam
test_input_sorted.bam.bai
We can use another for loop
to run idxstats
on all files. We only need the chromosome name and the number of reads (columns one and tree of the output table), so here I am additionally using cut
to keep only the required columns.
for FILE in *_sorted.bam
do
NEW_FILE=counts_${FILE%_sorted.bam}.txt
samtools idxstats ${FILE} | cut -f 1,3 > ${NEW_FILE}
done
Here's the new list of files.
$ ls
counts_ctrl_chip.txt
counts_ctrl_input.txt
counts_test_chip.txt
counts_test_input.txt
ctrl_chip.bam
ctrl_chip_sorted.bam
ctrl_chip_sorted.bam.bai
ctrl_input.bam
ctrl_input_sorted.bam
ctrl_input_sorted.bam.bai
test_chip.bam
test_chip_sorted.bam
test_chip_sorted.bam.bai
test_input.bam
test_input_sorted.bam
test_input_sorted.bam.bai
Let's take a look at one of the generated count files.
$ less counts_ctrl_chip.txt
chrI_S288C 23634
chrII_S288C 43718
chrIII_S288C 25952
chrIV_S288C 79575
chrV_S288C 34144
chrVI_S288C 28486
chrVII_S288C 59480
chrVIII_S288C 36607
chrIX_S288C 26724
chrX_S288C 39765
chrXI_S288C 33028
chrXII_S288C 119324
chrXIII_S288C 47988
chrXIV_S288C 39898
chrXV_S288C 59733
chrXVI_S288C 55529
chrI_SK1 79193
chrII_SK1 179584
chrIII_SK1 81516
chrIV_SK1 325919
chrV_SK1 136771
chrVI_SK1 122649
chrVII_SK1 238245
chrVIII_SK1 141021
chrIX_SK1 100759
chrX_SK1 159691
chrXI_SK1 137738
chrXII_SK1 215239
chrXIII_SK1 199173
chrXIV_SK1 169165
chrXV_SK1 235665
chrXVI_SK1 221117
* 0
Now we can compute the spike-in normalization factor from the read counts. This consists on two main steps. First we get counts of reads aligning to each of the two genomes: S288C, the spike-in genome, and SK1, the test strain genome background. With that, we can compute the normalization factor to apply to the test sample, by normalizing each ChIP sample to its input, followed by normalizing each sample to the spike-in, followed, finally, by getting the ratio of the test sample to the reference or control sample. This will be done in R, using a function available in the package.
> hwglabr2::spikein_normalization_factor(ref_chip_counts='counts_ctrl_chip.txt',
ref_input_counts='counts_ctrl_input.txt',
test_chip_counts='counts_test_chip.txt',
test_input_counts='counts_test_input.txt')
[1] 0.4242515
This prints some information to the console and returns the normalization factor by which to multiply the test sample data, allowing semi-quantitative comparison to the control sample. In this example, the test sample contains about 42.4% of the amount of target protein detected in the control sample.