Skip to content

Compute spike in normalization factor from BAM alignment maps

Luís edited this page Jan 16, 2018 · 6 revisions

The lab's ChIP-seq spike-in method allows you to get semi-quantitative measurements of target protein abundance. It can be summarized in the following main steps:

  1. Mix-in the SK288c spike-in strain cells with the experimental strain cells (typically SK1) before ChIP;
  2. Perform regular ChIP;
  3. Sequence spiked control and test sample ChIP libraries;
  4. Align the output reads to a hybrid genome between the experimental strain (typically SK1) and the SK288c spike-in strain (using the lab's pipeline: ChIPseq_Pipeline_hybrid_genome).
  5. Compute a normalizing constant by which to multiply the test sample data to allow semi-quantitative comparison to the reference sample.

The following is an illustration of how to perform the last step: computing the normalization factor.

We will be working with two example samples, a control and a test sample. Both strains were constructed on the SK1 background and the cells were mixed with SK288C spike-in cells (at a proportion of 80%:20%) before ChIP. We will start with the BAM alignment maps for the two strains, each with a ChIP and an Input sample.

$ ls

ctrl_chip.bam
ctrl_input.bam
test_chip.bam
test_input.bam

We need to compute the total number of reads mapped to each genome (SK1 and S288C) in each sample. We will use Samtools for that, but first we need to sort and index the BAM files. Let's use a for loop to save some typing.

Note: If you are working with SAM files, I suggest converting to BAM format first: samtools view -Sb your_file.sam > your_file.bam

Note 2: If you used the lab's ChIP-seq pipeline for hybrid genomes you already have sorted and indexed BAM files. You can safely skip this step and start at the Samtools idxstats step below.

for FILE in *.bam
do
    NEW_FILE=${FILE%.*}_sorted.bam
    samtools sort -o ${NEW_FILE} -O 'bam' -T 'temp' ${FILE}
    samtools index ${NEW_FILE}
done

We can use Samtools' idxstats to count the number of reads aligned to each chromosome. Here's the updated list of files.

$ ls

ctrl_chip.bam
ctrl_chip_sorted.bam
ctrl_chip_sorted.bam.bai
ctrl_input.bam
ctrl_input_sorted.bam
ctrl_input_sorted.bam.bai
test_chip.bam
test_chip_sorted.bam
test_chip_sorted.bam.bai
test_input.bam
test_input_sorted.bam
test_input_sorted.bam.bai

We can use another for loop to run idxstats on all files. We only need the chromosome name and the number of reads (columns one and tree of the output table), so here I am additionally using cut to keep only the required columns.

for FILE in *_sorted.bam
do
    NEW_FILE=counts_${FILE%_sorted.bam}.txt
    samtools idxstats ${FILE} | cut -f 1,3 > ${NEW_FILE}
done

Here's the new list of files.

$ ls

counts_ctrl_chip.txt
counts_ctrl_input.txt
counts_test_chip.txt
counts_test_input.txt
ctrl_chip.bam
ctrl_chip_sorted.bam
ctrl_chip_sorted.bam.bai
ctrl_input.bam
ctrl_input_sorted.bam
ctrl_input_sorted.bam.bai
test_chip.bam
test_chip_sorted.bam
test_chip_sorted.bam.bai
test_input.bam
test_input_sorted.bam
test_input_sorted.bam.bai

Let's take a look at one of the generated count files.

$ less counts_ctrl_chip.txt

chrI_S288C      23634
chrII_S288C     43718
chrIII_S288C    25952
chrIV_S288C     79575
chrV_S288C      34144
chrVI_S288C     28486
chrVII_S288C    59480
chrVIII_S288C   36607
chrIX_S288C     26724
chrX_S288C      39765
chrXI_S288C     33028
chrXII_S288C    119324
chrXIII_S288C   47988
chrXIV_S288C    39898
chrXV_S288C     59733
chrXVI_S288C    55529
chrI_SK1        79193
chrII_SK1       179584
chrIII_SK1      81516
chrIV_SK1       325919
chrV_SK1        136771
chrVI_SK1       122649
chrVII_SK1      238245
chrVIII_SK1     141021
chrIX_SK1       100759
chrX_SK1        159691
chrXI_SK1       137738
chrXII_SK1      215239
chrXIII_SK1     199173
chrXIV_SK1      169165
chrXV_SK1       235665
chrXVI_SK1      221117
*       0

Now we can compute the spike-in normalization factor from the read counts. This consists on two main steps. First we get counts of reads aligning to each of the two genomes: S288C, the spike-in genome, and SK1, the test strain genome background. With that, we can compute the normalization factor to apply to the test sample, by normalizing each ChIP sample to its input, followed by normalizing each sample to the spike-in, followed, finally, by getting the ratio of the test sample to the reference or control sample. This will be done in R, using a function available in the package.

> hwglabr2::spikein_normalization_factor(ref_chip_counts='counts_ctrl_chip.txt',
                                         ref_input_counts='counts_ctrl_input.txt',
                                         test_chip_counts='counts_test_chip.txt',
                                         test_input_counts='counts_test_input.txt')

[1] 0.4242515

This prints some information to the console and returns the normalization factor by which to multiply the test sample data, allowing semi-quantitative comparison to the control sample. In this example, the test sample contains about 42.4% of the amount of target protein detected in the control sample.