Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calmaxtag subcommand #39

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

JohnUrban
Copy link

Hi Tao,

If you are interested, I added a new subcommand called "calmaxtag" which is essentially a stripped down version of 'filterdup' that tells the user how many tags would be be allowed per site if they chose to use the 'auto' option. For example, "macs2 calmaxtag -g genome_size [-n numTags | -i inputFile]" calculates the maximum number of duplicate tags to leave at a site based on "-g genome_size" and either "-n numTags" or "-i inputFile". It is a simple addition to the set of macs2 subcommands that gives direct access to the MACS2 calculation without any other strings attached, especially if the number of tags is known in advance (-n option). It has been useful to me when exploring which macs2 settings I would like to use with a new data set and for 'on the fly' calculations for how many tags MACS2 would leave with the 'auto' option when talking to my advisor and/or collaborators. I have had this for a while and thought others might also find it useful to have.

best,
John Urban

p.s. I should note that I did not knowingly delete the "options.ratio" -- if I did that was accidental and they should be restored.

@taoliu
Copy link
Contributor

taoliu commented Jun 6, 2014

So basically, it will report the maximum number of duplicated tags allowed through binomial test. I am thinking of adding a 'dry-run' option for 'filterdup' sub-command will also work. How about it? Basically, I think the current way to keep duplicates is still very naive. I don't believe the maximum numbers of allowed duplicates should be the same across the whole genome. So I am aiming to update duplication removal process in the future.

@JohnUrban
Copy link
Author

Hi Tao,

Recently "spp" came up twice in my day-to-day work: once when I was looking into your refinepeak function and once when I was looking into methods for predicting fragment length and began reading about its strand cross correlation function. So, today I decided to learn a little more about it. The Brown BioMed division has a quick tutorial on using it so I started there: https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/spp-r-from-chip-seq.

Step 6 in the tutorial caught my attention.
It turns out the spp package deals with 'duplicate reads' in a similar way to what I was describing (filters in local windows). The function is called "remove.local.tag.anomalies". Here is its description:
"In Solexa ChIP-seq experiments some anomalous positions contain extremely high number of tags at the exact coordinates. The function scans the chromosomes, determining local tag density based on a provided window.size, doing two types of corrections: 1. removing all tags from positions that exceed local density by eliminate.fold; 2. reducing the tag count at positions exceeding cap.fold to the maximal allowed count. The statistical significance of counts exceeding either of these two threshold densities is calculated based on Poisson model, with confidence interval determined by the z.threshold Z-score parameter."

I figured I would share that, though you were probably already aware of it. Thanks for sharing the paper from David Gifford's group. I will look into it.

Cleaning up the reads locally before peak calling is the best way to go so you are calling peaks with the best estimate of how the signal should look over all spots in the genome for both the treatment and the control. When I recently did read counts and local binomial calculations in 1kb bins, the global model would allow 2 reads to stay in any of the bins. However, the local modeling showed that, of the ~2.8 million 1kb bins, ~7.8% were allowed to keep only 1 read, 53.7% were allowed to keep 2, and 38.5% were allowed to keep between 3 and 13 reads. To me this means that the global model only gets it right ~50% of the time (and that keeping just 1 read gets it right only ~7.8% of the time). Since we are dealing with large numbers, that means the global binomial model gets it wrong a lot: ~218,400 (7.8%) bins are not being capped stringently enough and a little over 1 million bins (38.5%) are being treated too harshly.

As an extreme example, I recently did an analysis where I added a full 43kb copy of the rDNA repeat to hg19 (as its own "chromosome") before mapping reads. There are >400 copies of this rDNA repeat in a normal genome. By adding in just 1 copy, I am looking at an average profile over the single repeat I provided. Due to the very high copy number, there are so many reads mapped to the 43kb stretch that a local binomial calculation over the 43kb window allows ~27 reads to stay at any position while the global model would only allow 2 or 3. If I let the global model filter the reads it would treat this area too harshly and result in a uniform distribution over all the sites. That is an extreme example, but the problem also exists for amplifications in cancer genomes. The local binomial will avoid putting a ceiling on the signal at these regions and will allow the true size/shape of the signal to persist.

I like your idea of checking the local complexity too. It would be great to have modular versions of these functions in macs2, but I wonder if it would be quicker to filter the reads and call the peaks locally in a single pass. If so, then that could be a good option to have in callpeak. If I had to choose, though, the modularity of macs2 is one of its best features and I would go with making them separate modules.

A final thought. What I have been doing with macs2 is using filterdup as a processing step before using callpeak at which point I choose to keep all reads. What is important about this is that I process each replicate individually. Thus, replicates are allowed to have reads that align to same position regardless of what statistics would say if they were all pooled first. This is especially important if I keep only one per position (which is what I have been doing for most analyses). If I were to pool the reads before keeping only 1 it would erroneously remove 'duplicate reads' that originated in separate replicates. A local binomial calculation would probably be robust enough where this pre-processing would not be necessary. Nonetheless, I would probably still even do the local binomial filtering as a pre-processing step to ensure that replicates are filtered independently of each other.

--John

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants