GenomicRanges 根据重合比例合并基因区间数据

最近在做 ATAC-Seq 分析,得到实验组和对照组 peaksets (本文称为 peaks1 和 peaks2) 后想合并得到 consensus peaksets. 用 bedtools intersect 做可以设置重合比例参数,但它无法从 peaks1 和 peaks2 计算并集为新 peaksets 然后输出。方法二是自己写 python 代码,但我觉得 GenomicRanges 可以实现,没必要重复造轮子。

实现过程大致如下:先 findOverlaps 计算两个 peaksets 重合索引,然后 pintersect 得到 peaksets 重合区域,并计算每个重合区域长度,分别除以原 peak 区域长度得到重合比例。根据自定的阈值(案例是任意比例大于 0.5)选择保留的重合,最后用 punion 合并。

library(GenomicRanges)

overlap_index <- findOverlaps(peaks1, peaks2)
# 重合区域及长度(bp)
overlaps <- pintersect(peaks1[queryHits(overlap_index)], peaks2[subjectHits(overlap_index)])
overlap_width <- width(overlaps)
# 重合比例
overlap_percent1 <- overlap_width / width(peaks1[queryHits(overlap_index)])
overlap_percent2 <- overlap_width / width(peaks2[subjectHits(overlap_index)])

# 按阈值筛选
keep_overlaps <- (overlap_percent1 > 0.5) | (overlap_percent2 > 0.5)
keep_index <- overlap_index[keep_overlaps]
merged_peaks <- punion(peaks1[queryHits(keep_index)], peaks2[subjectHits(keep_index)])

查看合并结果

> merged_peaks[1:3]
GRanges object with 3 ranges and 0 metadata columns:
      seqnames      ranges strand
              
  [1]     chr1  9959-10557      *
  [2]     chr1 11142-11518      *
  [3]     chr1 13231-13986      *
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

选取一条在两个 peaks 里重合比例都不为 1 的记录做检查,发现正确合并了两个 peak 区域。

> c(peaks1[3], peaks2[3], merged_peaks[3])
GRanges object with 3 ranges and 6 metadata columns:
      seqnames      ranges strand |        name     score signalValue    pValue    qValue      peak
               |        
  [1]     chr1 13231-13959      * |      peak_2      1000    1910.682  10.12469   7.41840       290
  [2]     chr1 13314-13986      * |      peak_2      1000     971.596   8.91901   5.75845       211
  [3]     chr1 13231-13986      * |                NA          NA        NA        NA      
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

你可能感兴趣的:(GenomicRanges 根据重合比例合并基因区间数据)