【必学Bioconductor包300个】IRanges

导航:
A. 简介
B. 操作An Overview of the IRanges package
2.1Normality
2.2Lists of IRanges objects
2.3Vector Extraction
2.4Finding Overlapping Ranges
2.5Counting Overlapping Ranges
2.6Finding Neighboring Ranges
2.7Transforming Ranges
2.8Set Operations
C. 常见问题Biostars

A. 简介

IRanges 用来处理IRanges数据实例。The IRangesfunction is a constructor that can be used to create IRanges instances. IRanges是Bioconductor上的基础数据类型,一般认为一段整数区间integer ranges或者说是interval ranges,如基因结构中一个外显子区域,也可用于解释基因组上的位置问题,由开始位置(start)和结束位置(end)定义。
-Data, e.g., aligned reads, ChIP peaks, SNPs, CpG islands, …
-Annotations, e.g., gene models, regulatory elements, methylated regions
-Ranges are defined by chromosome, start, end, and strand
-Often, metadata is associated with each range, e.g., quality of alignment, strength of ChIP peak

图1 人类基因KRAS的外显子表

B. 用法

1.安装并加载

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("IRanges")
library(IRanges)

2.IRanges对象操作

ir1 <- IRanges(start=1:10, width=10:1)
ir2 <- IRanges(start=1:10, end=11)
ir3 <- IRanges(end=11, width=10:1)
identical(ir1, ir2) && identical(ir1, ir3)#&&比较两个长度为1的向量,也叫标量比较
ir <- IRanges(c(1, 8, 14, 15, 19, 34, 40),
              width=c(12, 6, 6, 15, 6, 2, 7),names=letters[1:7])
class(ir)
#算术、切片、逻辑操作
start(ir)
end(ir)
width(ir)
start(ir) + 4
ir[1:4]
ir[start(ir) <= 15]
#可视化
plotRanges <- function(x, xlim = x, main = deparse(substitute(x)), col = "black", sep=0.5){
  height = 1
  if (is(xlim, "Ranges"))
    xlim = c(min(start(xlim)), max(end(xlim)))## 得到ir整个区域的最小值和最大值
  bins = disjointBins(IRanges(start(x), end(x)+1))# 从不相交的区间开始排列,建立index为1,剩下的bin不相交的建立index为2,以此类推 
  plot.new()
  plot.window(xlim, c(0, max(bins) * (height + sep)))
  ybottom = bins * (sep + height) - height
  rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col)
  title(main)
  axis(1)
  invisible(ybottom)
}
plotRanges(ir, col = "royalblue3")

2.1 Normality-reduce() merge redundant ranges to NormalIRanges

dir <- reduce(ir) # 合并有交集的区域,如果是测序数据片段,连接成一段序列
names(dir) <- paste0("gene",letters[24:26])
plotRanges(reduce(ir),col = "royalblue3")

2.2 Lists of IRanges objects将多个IRanges对象合并为list

rl <- IRangesList(ir, rev(ir))#rev()反向排列输出
start(rl)
class(rl)

2.3 Vector Extraction-提取一段序列

set.seed(0) # 随机数据
lambda <- c(rep(0.001, 4500), seq(0.001, 10, length=500),seq(10, 0.001, length=500))#seq()随机生成数据
xVector <- rpois(1e7, lambda)# Poisson分布生成λ为lambda的1e7个数据
# 泊松分布的参数λ是单位时间(或单位面积)内随机事件的平均发生率。适合于描述单位时间内随机事件发生的次数。
yVector <- rpois(1e7, lambda[c(251:length(lambda), 1:250)])
xRle <- Rle(xVector) #为了在R中存放占用大容量的序列,IRanges使用run-length encoding, RLE来压缩这些序列,如444333222就可以表示为3个4,3个3,3个2。 
yRle <- Rle(yVector)
irextract <- IRanges(start=c(4501, 4901) , width=100)
xRle[irextract]

2.4 Finding Overlapping Ranges

寻找overlap是许多组学分析任务的必要环节。RNA-seq,overlaps用于细胞的活动,看表达量多少,鉴定不同的isoform;findOverlaps(query, subject, maxgap = -1L, minoverlap = 0L, type = c("any", "start", "end", "within", "equal"), select = c("all", "first", "last", "arbitrary"), ...)

ol <- findOverlaps(ir, dir)#找到的overlap结果储存在ol这个对象中,包含两列
as.matrix(ol)
#使用accessor函数queryHits()和subjectHits()获得具体的索引号
names(ir)[queryHits(ol)]
names(dir)[subjectHits(ol)]
#Overlaps表示的是query和subject之间的映射关系(可以理解成位置),以matrix呈现两列,进行了命名理解起来更容易。
#例如query的1号(也就是a序列)与subject的1号(也就是genex序列有重叠);而subjectHits中有5个1出现,那么就是genex序列有5段分别和a,b,c,d,e序列重叠
#type参数-它默认采用any的比对模式,也就是两个序列只要有重叠,就计算在内。如果想找query完全落在subject中的序列,就需要使用within模式;
#selecte参数-如果一个query同时比对到subject的多个位置,方便选择
图2 Overlap和邻近距离理解

2.5 Counting Overlapping Ranges

coverage()计算每个位置上的Ranges数,也可理解为一定长度序列中ranges的重叠深度

cov <- coverage(ir) # 按照是否相交排序分类,归于哪一类的问题,以及每个位点对应的个数
plotRanges(ir) #可视化后明显分为三类
cov <- as.vector(cov)
mat <- cbind(seq_along(cov)-0.5, cov)
d <- diff(cov) != 0
mat <- rbind(cbind(mat[d,1]+1, mat[d,2]), mat)
mat <- mat[order(mat[,1]),]
lines(mat, col="red", lwd=4)#有数据1,2部分数据重叠2,以此类推
axis(2)

2.6 Finding Neighboring Ranges寻找邻近Ranges,计算距离

Nearest函数查找最近相邻范围(重叠为零),precede和follow函数查找特定边上不重叠的最近相邻范围

qry <- IRanges(start = 6, end = 13, names = 'query')
sbj <- IRanges(start = c(2,4,18,19), end=c(4,5,21,24),names=1:4)
nearest(qry, sbj) #最近的距离
precede(qry,sbj) #寻找前面的ranges
follow(qry,sbj) #寻找后面的ranges
#也可以vertorization
qry2 <- IRanges(start=c(6,7), width=3)
nearest(qry2,sbj)
#distanceToNearest()和distance()确定两个ranges之间的距离
qry <- IRanges(sample(seq_len(1000),5),width=10)
sbj <- IRanges(sample(seq_len(1000),5),width=10)
distanceToNearest(qry,sbj)
distance(qry,sbj) #仅返回距离

2.7 Transforming Ranges
2.7.1 Adjusting starts, ends and widths 调整开始、终止、宽度

shift(ir, 10)# 表示该区域整体向前平移十个单位比如[1, 10] → [11, 20]
# 同样的还有函数 narrow, resize, flank, reflect, restrict, and threebands
# narrow(x, start=NA, end=NA, width=NA, use.names=TRUE)
narrow(ir, start=1:5, width=2)  # 将原来的开始位置按照1到5逐级增加,保持宽度为2,比如原来的c(1, 2, 3, 4, 5, 6) → c(1, 3, 5, 7, 9, 6)
# resize(x, width, fix="start", use.names=TRUE, ...)
resize(ir, width = 2, fix = "start") # 保持start不变,将宽度调整为2
resize(ir, width = 2, fix = "end") # 保持end不变,将宽度调整为2
# restrict(x, start=NA, end=NA, keep.all.ranges=FALSE, use.names=TRUE)
restrict(ir, start=2, end=3)
# threebands(x, start=NA, end=NA, width=NA)延伸了narrow函数的功能,返回3个相关范围,分别对应到"left","middle" 和 "right",其中"middle"范围对应的就是narrow函数返回的范围:
threebands(ir, start=1:5, width=2)
#  arithmetic operators +, - and *
ir + seq_len(length(ir))
ir * -2 # 将区域扩大两倍
ir * 2  # 将区域缩小两倍

2.7.2 Making ranges disjoint

# disjoin函数通过将IRanges对象分割为重叠Ranges相同的最宽范围,使其不相交。
disjoin(ir)
plotRanges(disjoin(ir))
# disjointBins将范围划分为多个bins,这样每个容器中的范围都是不相交的。返回值是bins的整数向量
disjointBins(ir)
图3 disjoin

2.7.3 Other transformations

# reflect 在一组公共引用边界内翻转每个范围。
reflect(ir, IRanges(start(ir), width=width(ir)*2))

# flank返回指定宽度的范围,一个例子是为一组基因形成启动子区域
flank(ir, width=seq_len(length(ir)))
# flank(x, width, start=TRUE, both=FALSE, use.names=TRUE, ...)获得两端部分(flank),
#比如想得到左侧:转录起始位点;右侧:转录终止位点
#http://www.bioconductor.org/help/course-materials/2014/SeattleOct2014/A01.3_BioconductorForSequenceAnalysis.html
#flank默认是计算上游,如果要计算下游的终止位点设置(start=FALSE)
ir.trans <- flank(ir, width = 2, both = T) # 表示以start为中心区域向两侧扩展2个单位,即1变为c(-1,2)
plots <- function(x, xlim = x, main = deparse(substitute(x)), col = "black",
                  add = FALSE, ybottom = NULL, ...) {
  require(scales)
  col <- alpha(col, 0.5)
  height <- 1
  sep <- 0.5
  if (is(xlim, "Ranges")) {
    xlim <- c(min(start(xlim)), max(end(xlim)) * 1.2)
  }
  if (!add) {
    bins <- disjointBins(IRanges(start(x), end(x) + 1))
    ybottom <- bins * (sep + height) - height
    par(mar = c(3, 0.5, 2.5, 0.5), mgp = c(1.5, 0.5, 0))
    plot.new()
    plot.window(xlim, c(0, max(bins) * (height + sep)))
  }
  rect(start(x) - 0.5, ybottom, end(x) + 0.5, ybottom + height, col = col,
       ...)
  text((start(x) + end(x))/2, ybottom + height/2, 1:length(x), col = "white",
       xpd = TRUE)
  title(main)
  axis(1)
  invisible(ybottom)
}
xlim <- c(0, max(end(ir, ir.trans)) * 1.3)
ybottom <- plots(ir, col = "red")
plots(ir.trans, col = "blue", add = TRUE, ybottom = ybottom)

2.8 Set Operations集合运算

# gap()(gap:前一个大区域的结尾到后一个大区域的开头)
gaps(ir, start=1, end=50) # 查找1到50这个区间,ir没有覆盖到的区域即gap
plotRanges(gaps(ir, start=1, end=50), c(1,50))
#union并集,intersect交集,setdiff补集,逐对操作,pintersect,psetdiff,punion,pgap

3 Vector Views

Views存储一个类似向量的对象,称为“subject”,以及一个IRanges对象,定义subject的范围。每一个Range都代表了基于这个主题的View。

3.1 Creating Views创建

#Views()基于指标,slice()基于数字边界。
xViews <- Views(xRle, xRle >= 1)
xViews <- slice(xRle, 1)
xRleList <- RleList(xRle, 2L * rev(xRle))
xViewsList <- slice(xRleList, 1)

3.2 Aggregating Views

#本地函数viewMaxs、viewMins、viewSums和viewMeans用于描述性统计分析
head(viewSums(xViews))
viewSums(xViewsList)
head(viewMaxs(xViews))
viewMaxs(xViewsList)

4 Lists of Atomic Vectors

showClass("RleList")
args(IntegerList)
cIntList1 <- IntegerList(x=xVector, y=yVector)
cIntList1
sIntList2 <- IntegerList(x=xVector, y=yVector, compress=FALSE)
sIntList2
# sparse integer list整数list
xExploded <- lapply(xVector[1:5000], function(x) seq_len(x))
cIntList2 <- IntegerList(xExploded)
sIntList2 <- IntegerList(xExploded, compress=FALSE)
object.size(cIntList2)
object.size(sIntList2)
#lengths()整数向量包含每个元素的长度;
#length()向量衍生对象元素的数目,list衍生的像simple list 或compressed list
length(cIntList2)
Rle(lengths(cIntList2))
#lapply/sapply循环, [[元素提取,c连接
system.time(sapply(xExploded, mean))
system.time(sapply(sIntList2, mean))
system.time(sapply(cIntList2, mean))
identical(sapply(xExploded, mean), sapply(sIntList2, mean))
identical(sapply(xExploded, mean), sapply(cIntList2, mean))
#AtomicList objects support the Ops (e.g. +, ==, &), Math (e.g.log, sqrt), Math2 (e.g. round, signif), Summary (e.g. min, max, sum), and Complex (e.g. Re,Im) group generics.
xRleList > 0
yRleList <- RleList(yRle, 2L * rev(yRle))
xRleList + yRleList
sum(xRleList > 0 | yRleList > 0)
# 原子列表来自List,它们还可以使用循环函数endoapply来执行自同态
safe.max <- function(x) { if(length(x)) max(x) else integer(0) }
endoapply(sIntList2, safe.max)

5 Session Information

sessionInfo()#查看编译此文档的系统上的输出

C. IRanges包中常见问题梳理

这儿访问的是biostars问答网站https://www.biostars.org/,这是一个专注于生物信息类的问答的网站,随着生物数据分析的流行,biostars人气是越来越高。通过多年的累计,“几乎”你所遇到的技术问题,这里面都能找到答案。
首先,单纯的问答网页,清晰明了,用过社交网站应该一看就会;

图4

点击右上角ALL,可以看到关键帖子的频数;RNA-seq是热点,然后是R(2020-03-12)

图5

其次,支持关键词搜索。遇到问题先检索,然后再提问,也能提高学习效率;我们在搜索框输入IRanges,得到50个结果;


图6

我们可以看到Overlaps 提到9+,IRanges中热点问题是关于Overlaps的,可见很多人都跳过坑,如果你也感兴趣,来参与讨论吧,分享你的经验或者困惑~ S4 class 提到2次,要理解每个包支持的数据结构。此外,通过看问题,深刻理解了找区域,找位点,是IRanges最常用的功能,也是必须要掌握的。下面是我简单将问题进行罗列:

  1. 问题:I can not load a fasta file in Rstudio.无法在Rstudio中加载Fasta文件? 大概是安装方式出了问题,解答让安装最新版解决。

  2. 答:What Is The Best Library For Handling Interval Logic? 什么是处理间隔逻辑的最佳库?

  3. 答:Calculate Mapping Density Along Chromosome Coordinates计算沿染色体坐标的映射密度?

  4. 答:Creating fasta files to BLAST- what's that fastest way创建BLAST的Fasta文件-最快的方法是什么

  5. 答:How To Determine Overlaps From Coordinates如何根据坐标确定Overlaps?

  6. 答:Finding overlapping ranges in R; 在R中找overlaps

  7. Bioconductor - Error : Function found is not S4 generic Bioconductor-报错Error:发现的函数不是S4通用的 #重启解决

  8. 答:The Longest Chromosome > Sizeof(Int32)最长的染色体> Sizeof(Int32)

  9. 答:Finding Common Annotations In Gtf And Gff Files Of Different Origin在不同来源的GTF和GFF文件中查找通用注释

  10. 答:How do I subset a GRanges on chromosome, region and strand?如何在染色体,区域和链上划分GRanges?

  11. 答:Bedtools, report feature with highest overlap if overlaps two features如果两个功能重叠,则报表功能重叠最多。#其实有提到Bedtools可以替代上述代码的功能,日后可以了解一下

  12. 答:Quick Programming Challenge: How Do I Calculate Reference Coverage From A Table快速编程挑战:如何从表中计算参考coverage

  13. 答:How To Use R To Segment Genome And Count Reads From Sequencing Data? 如何使用R分割基因组并计数测序数据中的读数?#Countoverlap

  14. 答:What Is The Quickest Algorithm For Range Overlap? Overlap最快的算法是什么?

  15. A:Extract intergenic coordinate from gff从gff提取基因间坐标

  16. 答:Tools for gene annotation基因注释工具

  17. 答:Look for SNPs that span segments with R? 寻找跨越带有R的片段的SNP

  18. Cannot load package "hgu133a.db"无法加载软件包“ hgu133a.db”

  19. A:How To Intersect A Range With Single Positions如何使区域与单个位置相交

  20. 答:How do I get ranges for first or last n nucleotides in genomicranges如何获得基因组范围中前n个或后n个核苷酸的范围

  21. 答:Retrieve all the differentially methylated CpG's in a region检索区域中所有差异甲基化的CpG

  22. A:Found Correspondent Numbers In Integer Intervals (R)以整数间隔(R)找到对应的号码

  23. I can not load a fasta file in Rstudio我无法在Rstudio中加载Fasta文件

  24. 答:Cross-Correlation On Chip-Seq Like Data? 像芯片序列数据上的互相关?
    (by findOverlaps in R IRanges or using BEDTools).

  25. 答:Remove entries in a IRangesList删除IRangesList中的条目

  26. Alternate Transcripts Paralogs ? A: 副记录

  27. A: Quick And Easy Way To Find Whether A Locus Is Falling In Which Region Of Exon一种快速简便的方法来确定一个位点落在外显子的哪个区域

  28. Representing Interaction Data Using Iranges/Granges For 5C/Hi-C Data Analysis使用Iranges / Granges表示交互数据进行5C / Hi-C数据分析?

  29. 答:SciClone installation in R version 3.3.3 R版本3.3.3中的SciClone安装

  30. Error in .normargSEW0(start, "start"):'start' must be a numeric vector (or NULL).normargSEW0(start,“ start”)中的错误:“ start”必须是数字矢量(或NULL)

  31. Cannot coerce class 'structure("IRanges", package = "IRanges")' into a data.frame error.无法将类'structure(“ IRanges”,package =“ IRanges”)转换为data.frame。

  32. Calculate Gene Density Per Kb And Plot Density Over Position For All Scaffolds Of A Draft Genome Using R使用R计算草案基因组的所有支架的每Kb基因密度和位置上的图密度

  33. 答:Multiple Samples Data And Chromosome Ideograms From Cnv Caller来自Cnv调用者的多个样本数据和染色体表意文字

  34. In R bioconductor, how to combine DNAString views please ?答:在R Bioconductor中,如何结合DNAString views?

  35. SciClone installation in R version 3.3.3 R版本3.3.3中的SciClone安装?

  36. 答:Quick Programming Challenge: Calculate Common And Unique Regions From A List Of快速编程挑战:从列表中计算公共和唯一区域

  37. What is a good strategy to find hit regions from a GWAS in R?从R中的GWAS中找到命中区域的好策略是什么?

  38. Intersecting A Set Of Bam Reads With A Set Of Coordinates A: 将一组Bam reads与一组坐标相交

  39. Plot average expression profile from one bed file across overlapping regions A: 跨重叠区域从一个bed文件绘制平均表达式配置文件

  40. A:New Hg18 Genome In R R中新的Hg18基因组

  41. A:Finding overlapping ranges in R? 在R中找到重叠范围

  42. CellCODE installation issues CellCODE安装问题

  43. How to get overlap count (in basepair) of two IRange objects?如何获得两个IRange对象的overlap count(以碱基对计)?

  44. Remove entries in a IRangesList? 删除IRangesList中的条目

  45. 答:Common genomic intervals in R ? R中常见的基因组间隔

如果还有很多疑惑,关于Bioconductor的包接下来会一一介绍。


Bioconductor上R包小结

参考资料(方便想进一步了解的去看源文件):
[1] Michael L., Wolfgang H., Herve´ Page`s, et al. (2013)Software for Computing and Annotating Genomic Ranges. PLoS Comput Biol 9(8):e1003118. doi:10.1371/journal.pcbi.1003118
[2][Bioconductor基础--IRanges和Range thinking]https://mp.weixin.qq.com/s?src=11×tamp=1584153503&ver=2215&signature=oxgHX99O10ky9BK8-ORjaArSjsbZkpK4st-W6E3rQMIa34Ikq1JlN6tpBgty7TDSBdF4rtRPgiLTkcZtVmD9nZojH5BR9f3RbDcZ7LeLheEMs48*KhrA09rzjzsciI&new=1
[3][Bioconductor for Sequence Analysis]http://www.bioconductor.org/help/course-materials/2014/SeattleOct2014/A01.3_BioconductorForSequenceAnalysis.html
[4]Parnell LD, Lindenbaum P, Shameer K, Dall’Olio GM, Swan DC, et al. (2011) BioStar: An Online Question & Answer Resource for the Bioinformatics Community. PLoS Comput Biol 7(10): e1002216. doi:10.1371/journal.pcbi.1002216

你可能感兴趣的:(【必学Bioconductor包300个】IRanges)