calmaxtag subcommand #39

JohnUrban · 2014-06-04T19:00:51Z

Hi Tao,

If you are interested, I added a new subcommand called "calmaxtag" which is essentially a stripped down version of 'filterdup' that tells the user how many tags would be be allowed per site if they chose to use the 'auto' option. For example, "macs2 calmaxtag -g genome_size [-n numTags | -i inputFile]" calculates the maximum number of duplicate tags to leave at a site based on "-g genome_size" and either "-n numTags" or "-i inputFile". It is a simple addition to the set of macs2 subcommands that gives direct access to the MACS2 calculation without any other strings attached, especially if the number of tags is known in advance (-n option). It has been useful to me when exploring which macs2 settings I would like to use with a new data set and for 'on the fly' calculations for how many tags MACS2 would leave with the 'auto' option when talking to my advisor and/or collaborators. I have had this for a while and thought others might also find it useful to have.

best,
John Urban

p.s. I should note that I did not knowingly delete the "options.ratio" -- if I did that was accidental and they should be restored.

…tags could pile up at one site without macs2 doing anything else. The changes/new files are documented in johnREADME

… subcommand

taoliu · 2014-06-06T19:52:44Z

So basically, it will report the maximum number of duplicated tags allowed through binomial test. I am thinking of adding a 'dry-run' option for 'filterdup' sub-command will also work. How about it? Basically, I think the current way to keep duplicates is still very naive. I don't believe the maximum numbers of allowed duplicates should be the same across the whole genome. So I am aiming to update duplication removal process in the future.

JohnUrban · 2014-06-13T23:13:11Z

Hi Tao,

Recently "spp" came up twice in my day-to-day work: once when I was looking into your refinepeak function and once when I was looking into methods for predicting fragment length and began reading about its strand cross correlation function. So, today I decided to learn a little more about it. The Brown BioMed division has a quick tutorial on using it so I started there: https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/spp-r-from-chip-seq.

Step 6 in the tutorial caught my attention.
It turns out the spp package deals with 'duplicate reads' in a similar way to what I was describing (filters in local windows). The function is called "remove.local.tag.anomalies". Here is its description:
"In Solexa ChIP-seq experiments some anomalous positions contain extremely high number of tags at the exact coordinates. The function scans the chromosomes, determining local tag density based on a provided window.size, doing two types of corrections: 1. removing all tags from positions that exceed local density by eliminate.fold; 2. reducing the tag count at positions exceeding cap.fold to the maximal allowed count. The statistical significance of counts exceeding either of these two threshold densities is calculated based on Poisson model, with confidence interval determined by the z.threshold Z-score parameter."

I figured I would share that, though you were probably already aware of it. Thanks for sharing the paper from David Gifford's group. I will look into it.

Cleaning up the reads locally before peak calling is the best way to go so you are calling peaks with the best estimate of how the signal should look over all spots in the genome for both the treatment and the control. When I recently did read counts and local binomial calculations in 1kb bins, the global model would allow 2 reads to stay in any of the bins. However, the local modeling showed that, of the ~2.8 million 1kb bins, ~7.8% were allowed to keep only 1 read, 53.7% were allowed to keep 2, and 38.5% were allowed to keep between 3 and 13 reads. To me this means that the global model only gets it right ~50% of the time (and that keeping just 1 read gets it right only ~7.8% of the time). Since we are dealing with large numbers, that means the global binomial model gets it wrong a lot: ~218,400 (7.8%) bins are not being capped stringently enough and a little over 1 million bins (38.5%) are being treated too harshly.

As an extreme example, I recently did an analysis where I added a full 43kb copy of the rDNA repeat to hg19 (as its own "chromosome") before mapping reads. There are >400 copies of this rDNA repeat in a normal genome. By adding in just 1 copy, I am looking at an average profile over the single repeat I provided. Due to the very high copy number, there are so many reads mapped to the 43kb stretch that a local binomial calculation over the 43kb window allows ~27 reads to stay at any position while the global model would only allow 2 or 3. If I let the global model filter the reads it would treat this area too harshly and result in a uniform distribution over all the sites. That is an extreme example, but the problem also exists for amplifications in cancer genomes. The local binomial will avoid putting a ceiling on the signal at these regions and will allow the true size/shape of the signal to persist.

I like your idea of checking the local complexity too. It would be great to have modular versions of these functions in macs2, but I wonder if it would be quicker to filter the reads and call the peaks locally in a single pass. If so, then that could be a good option to have in callpeak. If I had to choose, though, the modularity of macs2 is one of its best features and I would go with making them separate modules.

A final thought. What I have been doing with macs2 is using filterdup as a processing step before using callpeak at which point I choose to keep all reads. What is important about this is that I process each replicate individually. Thus, replicates are allowed to have reads that align to same position regardless of what statistics would say if they were all pooled first. This is especially important if I keep only one per position (which is what I have been doing for most analyses). If I were to pool the reads before keeping only 1 it would erroneously remove 'duplicate reads' that originated in separate replicates. A local binomial calculation would probably be robust enough where this pre-processing would not be necessary. Nonetheless, I would probably still even do the local binomial filtering as a pre-processing step to ensure that replicates are filtered independently of each other.

--John

…nt accidentally.

JohnUrban added 3 commits June 4, 2014 13:35

Added subcommand 'calmaxtag' which allows user to calculate how many …

223a3e5

…tags could pile up at one site without macs2 doing anything else. The changes/new files are documented in johnREADME

edited johnREADME to correct filterdup.py and calmaxtag.py message

f53d9a0

Added calmaxtagExampleUse file which has to examples of using the new…

c54bdfa

… subcommand

JohnUrban added 5 commits June 14, 2014 17:37

cleaning up files to prepare git flow model...

a26d5b0

cleaning... removed macs2 file to replace with one inupstream master

f136696

Added the macs2 file back

51c0225

added some files to ignore

f1e1d49

Added ratio txt back to OptValidatory.py which was deleted at somepoi…

0fc3e6b

…nt accidentally.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

calmaxtag subcommand #39

calmaxtag subcommand #39

JohnUrban commented Jun 4, 2014

taoliu commented Jun 6, 2014

JohnUrban commented Jun 13, 2014

calmaxtag subcommand #39

Are you sure you want to change the base?

calmaxtag subcommand #39

Conversation

JohnUrban commented Jun 4, 2014

taoliu commented Jun 6, 2014

JohnUrban commented Jun 13, 2014