IRanges
用于解决序列在基因组上的位置问题,GRanges
在IRanges的基础上增加了染色体和DNA链的信息。在GRanges
内部可以使用IRanges
构建序列位置信息,速度比data.frame
快许多
基础操作
一个单独的ranges对象可以有多个intervals,我们可以对其中的每一个intervals进行操作也可以对ranges对象整体进行操作
创建对象
创建一个GRanges
需要指定names,seqnames,ranges,strand等信息,这些称作对象的元数据,另外还可以创建其他meta信息,在GRanges
中用|
分隔,在对象中不仅包含区间信息,还包含染色体信息,下面的seqinfo
展示了对象中的染色体信息,包括seqnames染色体名称,seqlengths染色体总长度,isCircular是否成环,genome基因组信息
-
Rle
快速记录冗余信息,包括种类和重复次数 -
IRanges
快速记录位置信息,包括起点,终点和长度
> gr <- GRanges(
+ seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),
+ ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),
+ strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),
+ score = 1:10,
+ GC = seq(1, 0, length=10))
> gr
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
|
a chr1 101-111 - | 1 1
b chr2 102-112 + | 2 0.888888888888889
c chr2 103-113 + | 3 0.777777777777778
d chr2 104-114 * | 4 0.666666666666667
e chr1 105-115 * | 5 0.555555555555556
f chr1 106-116 + | 6 0.444444444444444
g chr3 107-117 + | 7 0.333333333333333
h chr3 108-118 + | 8 0.222222222222222
i chr3 109-119 - | 9 0.111111111111111
j chr3 110-120 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
获取属性
-
start()
每个intervals的起点 -
end()
每个intervals的终点 -
width()
每个intervals的区间宽度
# 获取每一条序列的长度,并得到其分布
> width(gr)
[1] 11 11 11 11 11 11 11 11 11 11
-
length()
返回对象的长度
> length(gr)
[1] 10
-
strand()
获取链属性 -
names()
获取行名
> names(gr)
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
# 也可以进行赋值操作
> names(gr) <- 1:10
> names(gr)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
-
seqinfo()
获取染色体信息,包括seqnames
,seqlengths
,isCircular
,genome
-
mcols()
获取列metadata,注意GRanges
中seqnames,ranges,strand属于元数据,不能通过mcols获取
区间操作
IRanges
和GRanges
也可有一些类似向量的操作,使用向量,名字以及逻辑值进行索引,也可以进行算术加减,不同对象之间也可以进行合并,分隔,取交集等操作。如果有多个对象,我们通过创建一个GRangesList
是很有用的,例如用于表示分组信息(比如每个基因的外显子)。该列表的元素是基因,并且在每个元素中,外显子的范围被定义为GRanges。数据结构类似于list,可以使用lapply
操作
# 使用逻辑判断获取子集
> gr[gr$score < 5]
GRanges object with 4 ranges and 2 metadata columns:
seqnames ranges strand | score GC
|
1 chr1 101-111 - | 1 1
2 chr2 102-112 + | 2 0.888888888888889
3 chr2 103-113 + | 3 0.777777777777778
4 chr2 104-114 * | 4 0.666666666666667
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
shift
指定intervals进行平移
# 整体平移,正值指定向染色体上游,负值指定向染色体下游
> shift(gr, shift = 10)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
|
1 chr1 111-121 - | 1 1
2 chr2 112-122 + | 2 0.888888888888889
3 chr2 113-123 + | 3 0.777777777777778
4 chr2 114-124 * | 4 0.666666666666667
5 chr1 115-125 * | 5 0.555555555555556
6 chr1 116-126 + | 6 0.444444444444444
7 chr3 117-127 + | 7 0.333333333333333
8 chr3 118-128 + | 8 0.222222222222222
9 chr3 119-129 - | 9 0.111111111111111
10 chr3 120-130 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
restrict
范围截取,指定起点和终点,获取指定范围内的序列
> restrict(gr, 105, 110)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
|
1 chr1 105-110 - | 1 1
2 chr2 105-110 + | 2 0.888888888888889
3 chr2 105-110 + | 3 0.777777777777778
4 chr2 105-110 * | 4 0.666666666666667
5 chr1 105-110 * | 5 0.555555555555556
6 chr1 106-110 + | 6 0.444444444444444
7 chr3 107-110 + | 7 0.333333333333333
8 chr3 108-110 + | 8 0.222222222222222
9 chr3 109-110 - | 9 0.111111111111111
10 chr3 110 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
flank
获取指定长度上下游序列;promoters
是该功能的增强版,可以轻易获取指定区间上下游序列
# 获取序列上游10bp
> flank(gr, width = 10)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
|
1 chr1 112-121 - | 1 1
2 chr2 92-101 + | 2 0.888888888888889
3 chr2 93-102 + | 3 0.777777777777778
4 chr2 94-103 * | 4 0.666666666666667
5 chr1 95-104 * | 5 0.555555555555556
6 chr1 96-105 + | 6 0.444444444444444
7 chr3 97-106 + | 7 0.333333333333333
8 chr3 98-107 + | 8 0.222222222222222
9 chr3 120-129 - | 9 0.111111111111111
10 chr3 121-130 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
# 获取序列下游10bp,指定start=F
> flank(gr, width = 10, start = F)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
|
1 chr1 91-100 - | 1 1
2 chr2 113-122 + | 2 0.888888888888889
3 chr2 114-123 + | 3 0.777777777777778
4 chr2 115-124 * | 4 0.666666666666667
5 chr1 116-125 * | 5 0.555555555555556
6 chr1 117-126 + | 6 0.444444444444444
7 chr3 118-127 + | 7 0.333333333333333
8 chr3 119-128 + | 8 0.222222222222222
9 chr3 99-108 - | 9 0.111111111111111
10 chr3 100-109 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
# 获取上下游序列时需要注意不能超出chr的范围,需要指定范围
-
reduce
组装,获取序列的并集
> reduce(gr)
GRanges object with 7 ranges and 0 metadata columns:
seqnames ranges strand
[1] chr1 106-116 +
[2] chr1 101-111 -
[3] chr1 105-115 *
[4] chr2 102-113 +
[5] chr2 104-114 *
[6] chr3 107-118 +
[7] chr3 109-120 -
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
disjoin
拆分,去掉所有overlap的序列区域,获取所有序列的补集,在研究可变剪切中很有用,类似于gaps()
> disjoin(gr)
GRanges object with 13 ranges and 0 metadata columns:
seqnames ranges strand
[1] chr1 106-116 +
[2] chr1 101-111 -
[3] chr1 105-115 *
[4] chr2 102 +
[5] chr2 103-112 +
... ... ... ...
[9] chr3 108-117 +
[10] chr3 118 +
[11] chr3 109 -
[12] chr3 110-119 -
[13] chr3 120 -
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
sort
对GRanges
对象内部进行排序
# 按照基因组的顺序排序,先排染色体再排正负链
> sort(gr)
GRanges object with 10 ranges and 2 metadata columns:
seqnames ranges strand | score GC
|
6 chr1 106-116 + | 6 0.444444444444444
1 chr1 101-111 - | 1 1
5 chr1 105-115 * | 5 0.555555555555556
2 chr2 102-112 + | 2 0.888888888888889
3 chr2 103-113 + | 3 0.777777777777778
4 chr2 104-114 * | 4 0.666666666666667
7 chr3 107-117 + | 7 0.333333333333333
8 chr3 108-118 + | 8 0.222222222222222
9 chr3 109-119 - | 9 0.111111111111111
10 chr3 110-120 - | 10 0
-------
seqinfo: 3 sequences from an unspecified genome; no seqlengths
-
findOverlaps
获取两个对象之间重复的区域,指定序列在指定区域是否富集,返回结果表示第一个对象中的第几条序列与第二个对象中的第几条序列存在overlap,类似于%over%
,%over%
直接返回逻辑值
> gr6 <- GRanges(seqnames = "chr2",
+ ranges = IRanges(start = c(6,8,12,14,21,22,23),width = c(11,4,2,5,7,7,7)),
+ strand = "*")
> gr7 <- GRanges(seqnames = "chr2",
+ ranges = IRanges(start = c(6,15),width = 10),
+ strand = "*")
> gr6
GRanges object with 7 ranges and 0 metadata columns:
seqnames ranges strand
[1] chr2 6-16 *
[2] chr2 8-11 *
[3] chr2 12-13 *
[4] chr2 14-18 *
[5] chr2 21-27 *
[6] chr2 22-28 *
[7] chr2 23-29 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> gr7
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
[1] chr2 6-15 *
[2] chr2 15-24 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> findOverlaps(gr6,gr7)
Hits object with 9 hits and 0 metadata columns:
queryHits subjectHits
[1] 1 1
[2] 1 2
[3] 2 1
[4] 3 1
[5] 4 1
[6] 4 2
[7] 5 2
[8] 6 2
[9] 7 2
-------
queryLength: 7 / subjectLength: 2
# 或者根据逻辑判断直接获取overlap子集
> gr6[gr6 %over% gr7]
GRanges object with 7 ranges and 0 metadata columns:
seqnames ranges strand
[1] chr2 6-16 *
[2] chr2 8-11 *
[3] chr2 12-13 *
[4] chr2 14-18 *
[5] chr2 21-27 *
[6] chr2 22-28 *
[7] chr2 23-29 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
-
tile
创建窗口,可以指定窗口数量以及窗口宽度, -
slidingWindows
创建滑动窗口,指定窗口长度以及窗口移动的步长 -
tileGenome
返回一组基因组区域,这些区域构成特定基因组的分区