GRanges

IRanges用于解决序列在基因组上的位置问题,GRanges在IRanges的基础上增加了染色体和DNA链的信息。在GRanges内部可以使用IRanges构建序列位置信息,速度比data.frame快许多

基础操作

一个单独的ranges对象可以有多个intervals,我们可以对其中的每一个intervals进行操作也可以对ranges对象整体进行操作

创建对象

创建一个GRanges需要指定names,seqnames,ranges,strand等信息,这些称作对象的元数据,另外还可以创建其他meta信息,在GRanges中用|分隔,在对象中不仅包含区间信息,还包含染色体信息,下面的seqinfo展示了对象中的染色体信息,包括seqnames染色体名称,seqlengths染色体总长度,isCircular是否成环,genome基因组信息

  • Rle 快速记录冗余信息,包括种类和重复次数
  • IRanges 快速记录位置信息,包括起点,终点和长度
> gr <- GRanges(
+     seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
+     ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
+     strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
+     score = 1:10,
+     GC = seq(1, 0, length=10))
> gr
GRanges object with 10 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
           |          
  a     chr1   101-111      - |         1                 1
  b     chr2   102-112      + |         2 0.888888888888889
  c     chr2   103-113      + |         3 0.777777777777778
  d     chr2   104-114      * |         4 0.666666666666667
  e     chr1   105-115      * |         5 0.555555555555556
  f     chr1   106-116      + |         6 0.444444444444444
  g     chr3   107-117      + |         7 0.333333333333333
  h     chr3   108-118      + |         8 0.222222222222222
  i     chr3   109-119      - |         9 0.111111111111111
  j     chr3   110-120      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
获取属性
  • start() 每个intervals的起点
  • end() 每个intervals的终点
  • width() 每个intervals的区间宽度
# 获取每一条序列的长度,并得到其分布
> width(gr)
 [1] 11 11 11 11 11 11 11 11 11 11
  • length() 返回对象的长度
> length(gr)
[1] 10
  • strand() 获取链属性
  • names()获取行名
> names(gr)
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

# 也可以进行赋值操作
> names(gr) <- 1:10
> names(gr)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
  • seqinfo() 获取染色体信息,包括seqnamesseqlengthsisCirculargenome
  • mcols()获取列metadata,注意GRanges中seqnames,ranges,strand属于元数据,不能通过mcols获取
区间操作

IRangesGRanges也可有一些类似向量的操作,使用向量,名字以及逻辑值进行索引,也可以进行算术加减,不同对象之间也可以进行合并,分隔,取交集等操作。如果有多个对象,我们通过创建一个GRangesList是很有用的,例如用于表示分组信息(比如每个基因的外显子)。该列表的元素是基因,并且在每个元素中,外显子的范围被定义为GRanges。数据结构类似于list,可以使用lapply操作

# 使用逻辑判断获取子集
> gr[gr$score < 5]
GRanges object with 4 ranges and 2 metadata columns:
    seqnames    ranges strand |     score                GC
           |          
  1     chr1   101-111      - |         1                 1
  2     chr2   102-112      + |         2 0.888888888888889
  3     chr2   103-113      + |         3 0.777777777777778
  4     chr2   104-114      * |         4 0.666666666666667
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • shift 指定intervals进行平移
# 整体平移,正值指定向染色体上游,负值指定向染色体下游
> shift(gr, shift = 10)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
            |          
   1     chr1   111-121      - |         1                 1
   2     chr2   112-122      + |         2 0.888888888888889
   3     chr2   113-123      + |         3 0.777777777777778
   4     chr2   114-124      * |         4 0.666666666666667
   5     chr1   115-125      * |         5 0.555555555555556
   6     chr1   116-126      + |         6 0.444444444444444
   7     chr3   117-127      + |         7 0.333333333333333
   8     chr3   118-128      + |         8 0.222222222222222
   9     chr3   119-129      - |         9 0.111111111111111
  10     chr3   120-130      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • restrict 范围截取,指定起点和终点,获取指定范围内的序列
> restrict(gr, 105, 110)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
            |          
   1     chr1   105-110      - |         1                 1
   2     chr2   105-110      + |         2 0.888888888888889
   3     chr2   105-110      + |         3 0.777777777777778
   4     chr2   105-110      * |         4 0.666666666666667
   5     chr1   105-110      * |         5 0.555555555555556
   6     chr1   106-110      + |         6 0.444444444444444
   7     chr3   107-110      + |         7 0.333333333333333
   8     chr3   108-110      + |         8 0.222222222222222
   9     chr3   109-110      - |         9 0.111111111111111
  10     chr3       110      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • flank 获取指定长度上下游序列;promoters是该功能的增强版,可以轻易获取指定区间上下游序列
# 获取序列上游10bp
> flank(gr, width = 10)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
            |          
   1     chr1   112-121      - |         1                 1
   2     chr2    92-101      + |         2 0.888888888888889
   3     chr2    93-102      + |         3 0.777777777777778
   4     chr2    94-103      * |         4 0.666666666666667
   5     chr1    95-104      * |         5 0.555555555555556
   6     chr1    96-105      + |         6 0.444444444444444
   7     chr3    97-106      + |         7 0.333333333333333
   8     chr3    98-107      + |         8 0.222222222222222
   9     chr3   120-129      - |         9 0.111111111111111
  10     chr3   121-130      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

# 获取序列下游10bp,指定start=F
> flank(gr, width = 10, start = F)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
            |          
   1     chr1    91-100      - |         1                 1
   2     chr2   113-122      + |         2 0.888888888888889
   3     chr2   114-123      + |         3 0.777777777777778
   4     chr2   115-124      * |         4 0.666666666666667
   5     chr1   116-125      * |         5 0.555555555555556
   6     chr1   117-126      + |         6 0.444444444444444
   7     chr3   118-127      + |         7 0.333333333333333
   8     chr3   119-128      + |         8 0.222222222222222
   9     chr3    99-108      - |         9 0.111111111111111
  10     chr3   100-109      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

# 获取上下游序列时需要注意不能超出chr的范围,需要指定范围
  • reduce组装,获取序列的并集
> reduce(gr)
GRanges object with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
            
  [1]     chr1   106-116      +
  [2]     chr1   101-111      -
  [3]     chr1   105-115      *
  [4]     chr2   102-113      +
  [5]     chr2   104-114      *
  [6]     chr3   107-118      +
  [7]     chr3   109-120      -
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • disjoin拆分,去掉所有overlap的序列区域,获取所有序列的补集,在研究可变剪切中很有用,类似于gaps()
> disjoin(gr)
GRanges object with 13 ranges and 0 metadata columns:
       seqnames    ranges strand
             
   [1]     chr1   106-116      +
   [2]     chr1   101-111      -
   [3]     chr1   105-115      *
   [4]     chr2       102      +
   [5]     chr2   103-112      +
   ...      ...       ...    ...
   [9]     chr3   108-117      +
  [10]     chr3       118      +
  [11]     chr3       109      -
  [12]     chr3   110-119      -
  [13]     chr3       120      -
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • sortGRanges对象内部进行排序
# 按照基因组的顺序排序,先排染色体再排正负链
> sort(gr)
GRanges object with 10 ranges and 2 metadata columns:
     seqnames    ranges strand |     score                GC
            |          
   6     chr1   106-116      + |         6 0.444444444444444
   1     chr1   101-111      - |         1                 1
   5     chr1   105-115      * |         5 0.555555555555556
   2     chr2   102-112      + |         2 0.888888888888889
   3     chr2   103-113      + |         3 0.777777777777778
   4     chr2   104-114      * |         4 0.666666666666667
   7     chr3   107-117      + |         7 0.333333333333333
   8     chr3   108-118      + |         8 0.222222222222222
   9     chr3   109-119      - |         9 0.111111111111111
  10     chr3   110-120      - |        10                 0
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths
  • findOverlaps 获取两个对象之间重复的区域,指定序列在指定区域是否富集,返回结果表示第一个对象中的第几条序列与第二个对象中的第几条序列存在overlap,类似于%over%%over%直接返回逻辑值
> gr6 <- GRanges(seqnames = "chr2",
+               ranges = IRanges(start = c(6,8,12,14,21,22,23),width = c(11,4,2,5,7,7,7)),
+               strand =  "*")
> gr7 <- GRanges(seqnames = "chr2",
+               ranges = IRanges(start = c(6,15),width = 10),
+               strand =  "*")
> gr6
GRanges object with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
            
  [1]     chr2      6-16      *
  [2]     chr2      8-11      *
  [3]     chr2     12-13      *
  [4]     chr2     14-18      *
  [5]     chr2     21-27      *
  [6]     chr2     22-28      *
  [7]     chr2     23-29      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gr7
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
            
  [1]     chr2      6-15      *
  [2]     chr2     15-24      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

> findOverlaps(gr6,gr7)
Hits object with 9 hits and 0 metadata columns:
      queryHits subjectHits
         
  [1]         1           1
  [2]         1           2
  [3]         2           1
  [4]         3           1
  [5]         4           1
  [6]         4           2
  [7]         5           2
  [8]         6           2
  [9]         7           2
  -------
  queryLength: 7 / subjectLength: 2

# 或者根据逻辑判断直接获取overlap子集
> gr6[gr6 %over% gr7]
GRanges object with 7 ranges and 0 metadata columns:
      seqnames    ranges strand
            
  [1]     chr2      6-16      *
  [2]     chr2      8-11      *
  [3]     chr2     12-13      *
  [4]     chr2     14-18      *
  [5]     chr2     21-27      *
  [6]     chr2     22-28      *
  [7]     chr2     23-29      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
  • tile创建窗口,可以指定窗口数量以及窗口宽度,
  • slidingWindows创建滑动窗口,指定窗口长度以及窗口移动的步长
  • tileGenome返回一组基因组区域,这些区域构成特定基因组的分区

你可能感兴趣的:(GRanges)