Lesson 4 is about pipeline automation, so I'm not going to make note about lesson 4.
1. Peak Calling Introduction
For ChIP-seq experiments, what we observe from the alignment files is a strand asymmetry with read densities on the +/- strand, centered around the binding site. The 5' ends of the selected fragments will form groups on the positive and negative strand. The distribution of these groups are then assessed using statistical measures and compared against backgroud(input or mock IP samples) to determine if the site of enrichment is likely to be a real binding site.
Q: why take 5' end of fragments into account only?
ChIP-Seq analysis algorithms are specialized in identifying one of the two types of enrichment (or have specific methods for each):broad peaks or broad domains(i.e. histone modifications that cover entire gene bodies) or narrow peaks(i.e. a transcription factor binding). Narrow peaks are easier to detect as we are looking for regions that have higher amplitude and are easier to distinguish from the backgroud, compared to broad or dispersed marks. There are also 'mixed' binding profies which can be hard for algorithms to discern. An example of this is the binding properties of Pol II, which binds at promoter and across the length of the gene resulting in mixed signals(narrow and broad).
2. MACS2( Peak Calling tools)
MACS2 named model-based analysis of ChIP-Seq, is a commonly used tool for identifying transcription factor binding sites. The MACS algorithm captures the influence of genome complexity to evaluate the significance of enriched ChIP regions. Although it was developed for the detection of transcription factor binding sites it has also suited for larger regions.
MACS improves the spatial resolution of binding sites through combining the information of both sequencing tag position and orientation. MACS can be easily used either for the ChIP sample alone, or along with a control sample which increases specificity of the peak calls. The MACS workflow is depicted below.
* Removing redundancy
Reads with the same start position are considered duplicates. These duplicates can arise from experimental artifacts, but also contribute to genuine ChIP-signal.
The bad kind of duplicates: If initial starting material is low this can lead to overamplification of this material before sequencing. Any biases in PCR will compound this problem and can lead to artificially enriched regions. Also blacklisted(repeat) regions with ultra high signal will also be high in duplicates. Masking these regions prior to analysis can help remove this problem.
The good kind of duplicates: Duplicates will also exist within highly efficient (or even inefficient ChIP) when deeply sequenced. Removal of these duplicates can lead to a saturation and so underestimation of ChIP signal
Take-home: Consider your enrichment efficiency and sequencing depth. But, because we cannot distinguish between the good and the bad, best practice is to remove duplicates prior to peak calling. Retain duplicates for differential binding analysis. Also, if you are expecting binding in repetitive regions keep duplicates and multiple mappers.
MACS provides different options for dealing with duplicates. The default is to keep a single read at each location. The auto option, which is commonly used, tells MACS to calculate the maximum tags at the exact same location based on binomal distribution using 1e-5 as the pvalue cutoff. An alternative is to set all option, which keeps every tag. If an integer is specified, then at most that many tags will be kept at the same location. This redundancy is consistently applied for both the ChIP and input samples.
* Modeling the shift size
The tag density around a true binding site should show a bimodal enrichment pattern( or paired peaks). MACS takes advantages of this biomodal pattern to empirically model the shifting size to better locate the precise binding sites.
To find paired peaks to build the model.MACS first scan the whole dataset searching for highly significant enriched regions. This is done only using the ChIP sample. Given a sonication size (bandwidth) and a high-confidence fold-enrichment (mfold), MACS slides two bandwidth windows across the genome to find regions with tags more that mfold enriched relative to a random tag genome distribution (genome background).
MACS randomly samples 1,000 of these high-quality peaks, separates their positive and negative strand tags, and aligns them by the midpoint between their centers. The distance between the modes of the two peaks in the alignment is defined as 'd' and represents the estimated fragment length. MACS shifts all the tags by d/2 toward the 3' ends to the most likely protein-DNA interaction sites.
* Scaling libraries
For experiments in which sequences depth differs between input and treatment samples, MACS linearly scales the total control tag count to be the same as the total ChIP tag count. The default behaviour is for the larger sample to be scaled down.
* Effective genome length
To calculate (background) from tag count, MACS2 requires the effective genome size or the size of the genome that is mappable. Mappability is related to the uniqueness of the k-mers at a particular position of the genome. Low-complexity and repetitive regions have low uniqueness, which means low mappability. Therefore we need to provide the effective genome length to correct for the loss of true signals in low-mappable regions.
* Peak detection
After MACS shifts every tag by d/2 , it then slides across the genome using a window size of 2d to find candidate peaks. The tag distribution along the genome can be modeled by a Poisson distribution. The Poisson is a one paramter model, where the parameter is the expected number of reads in that window.
Instead of using a uniform estimated from the whole genome, MACS uses a dynamic parameter, , defined for each candidate peak. The lambda parameter is estimated from the control sample and is deduced by taking the maximum value across various window sizes:
In this way lambda captures the influence of local biases, and is robust against occasional low tag counts at small local regions. Possible sources for these biases include local chromatin structure, DNA amplification and sequencing bias, and genome copy number variation.
A region is considered to have a significant tag enrichment if the p-value < 1e-5 (default). This is a Poisson distribution p-value based on .
Overlapping enriched peaks are merged, and each tag position is extended 'd' bases (why not d/2) from its center. The location in the peak with the highest fragment pileup, hereafter referred to as the summit, is predicted as the precise binding location. The ratio between the ChIP-seq tag and is reported as the fold enrichment.
Estimation of false discovery rate
Each peak is considered an independent test and thus, when we encounter thousands of significant peaks detected in a sample we have a multiple testing problem. In MACSv1.4, the FDR was determined empirically by exchanging the ChIP and control samples. However, in MACS2, p-values are now corrected for multiple comparison using the Benjamini-Hochberg correction.