导航:
A. 简介
B. 操作An Introduction to the GenomicRanges Package
C. 常见问题Biostars
A. 简介
The GenomicRanges package serves as the foundation for representing genomic locations within the Bioconductor project.
This package lays a foundation for genomic analysis by introducing three classes (GRanges, GPos, and GRangesList), which are used to represent genomic ranges, genomic positions, and groups of genomic ranges. This vignette focuses on the GRanges and GRangesList classes and their associated methods.
GenomicRanges 包是研究基因组位置的基础,是一组区间范围比如一个基因的所有exon。基因组位置(genomic range/intervals)由染色体号、起始和结束位点、链方向组成,每个基因组版本都有特定的位置信息。GRanges与IRanges相比,它可以定义序列名称,包括起始点及终止点的长度信息,正负链,或者他们的score值和GC值等。
B. 操作
1.下载、安装、加载
Sys.setenv(LANGUAGE = "en")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenomicRanges")
library(GenomicRanges)
2. 基本操作
2.1创建一个Granges
gr <- GRanges(
seqnames = Rle(c("chr1", "chr2", "chr1", "chr3"), c(1, 3, 2, 4)),#查看基因的名称
ranges = IRanges(101:110, end = 111:120, names = head(letters, 10)),#查看基因的区域
strand = Rle(strand(c("-", "+", "*", "+", "-")), c(1, 2, 2, 3, 2)),#查看基因位于哪条链
score = 1:10,
GC = seq(1, 0, length=10))
gr
10个基因组ranges,以|为界分为左右两块区域:
左边:数据【必选】基因组坐标信息(seqnames, ranges, strand);用granges(gr) 查看
右边:元数据【可选】基因的注释信息(score, GC 等); 用mcols(gr)查看,用 mcols(gr)$score 查看具体的元数据项
元数据:描述一个文件特征的系统数据,如访问权限、文件拥有者以及文件数据块的分布信息等等
width() 统计基因组序列长度分布;length() 计算行数;names() 查看最前列的名称
# seqnames, ranges,和strand存取函数
seqnames(gr)
ranges(gr)
strand(gr)
#granges不使用metadata提取GenomicRanges
granges(gr)
#mcols提取为DataFrame
mcols(gr)
mcols(gr)$score
#有关对齐的各种序列长度信息也可存储在GRanges中,如是人的数据可以设置为:
seqlengths(gr) <- c(249250621, 243199373, 198022430)
#然后检索为:
seqlengths(gr)
names(gr)
length(gr)
2.1 切割与合并
sp <- split(gr, rep(1:2, each=5))#切割
sp
c(sp[[1]], sp[[2]])#合并
2.2 Subsetting GRanges objects取子集
gr[2:3]# []取2、3列
gr[2:3, "GC"] #[]用来取|右边的注释数据
# 元素分配给GRanges对象
singles <- split(gr, names(gr))#根据names来分割
grMod <- gr
grMod
grMod[2] <- singles[[1]]#GRanges对象的第二行替换为gr的第一行
head(grMod, n=3)
# 重复、逆向、选择特定区域
rep(singles[[2]], times = 3) #重复3次
rev(gr)#倒置
head(gr,n=2)#选择特定区域前两行
tail(gr,n=2)#选择特定区域后两行
window(gr, start=2,end=4)#选择特定行
gr[IRanges(start=c(2,7), end=c(3,9))]#选择跨区域特定行
# 进行筛选
gr[gr$score<5]#筛选score小于5的
gr[gr$score>5 & gr$GC<0.8]#score大于5并且GC含量小于0.8的
gr[strand(gr) == "+"]#strand为正链的
gr[strand(gr) != "+"]#strand为非正链的
# 排序
# sort(gr)按基因组seqname顺序,先排正链+,再排负链-,最后排*(正链或负链)
# order(gr)根据某项可选内容(|右侧)排序
gr[order(gr$GC,decreasing = T)]#根据GC含量降序排序
2.3 Basic interval operations基础序列间隔操作
g <- gr[1:3]
g <- append(g, singles[[10]])
start(g)
end(g)
width(g)
range(g)#将一条链上相连的区域组合在一起统计
# intra-range methods内部, inter-range methods外部, and between-range methods之间.
# !intra-range methods!内部
# flank 取上游10bp的区域,也可取下游10bp[位置:+链起始位置向前,-链终止位置向后]
flank(g, 10)
grflank<- flank(g, 10, start=FALSE)#取下游
# 如果出现了负数,替换成1即可。
start(gr3[start(gr3) < 1]) = 1
# 分析上游upstream 其实就是等同于找启动子
# 分析下游downstram 可以寻找UTR,UTR长度不固定,3'-UTR/ trailer存在alternative poly A 调控元件。UTR越长,一般能target的microRNA也越长。
# 启动子存在于上游2k--下游100/200bp,并非一定在TSS (转录开始的位置)之前
# 图2
# 也可以用promoters函数
promoters(gr2, upstream = 2000, downstream = 10)
# shift 以特定数量的碱基对移动范围
shift(g, 5)##向上游偏移
# resize 将Ranges扩展到指定宽度
resize(g, 30)#调整width,只调整下游
# see help ?"intra-range-methods"有总结
# !inter-range methods!外部
# reduce()将Ranges合并overlap以生成一个简化区间
reduce(g)
# gaps() Ranges间的差距或质量
gaps(g)
# disjoin()不重叠范围的集合
disjoin(g)
# coverage() 所有范围重叠程度
coverage(g)
# see help ?"inter-range-methods" for more help
2.4 Interval set operations间隔集操作
#Between-range methods计算不同Granges间的关系
#重要的是findOverlaps以及相关操作
g2 <- head(gr, n=2)
union(g, g2) # 并集
intersect(g, g2) # 交集
setdiff(g, g2)# 补集
#punion()/pintersect()/psetdiff()用于GRanges彼此parallel,如对象1的元素1和对象2的元素1
#要求:每个GRanges对象中的元素数相同,且两个对象始终有相同的序列名和链
g3 <- g[1:2]
ranges(g3[1]) <- IRanges(start=105, end=112)
punion(g2, g3)
pintersect(g2, g3)
psetdiff(g2, g3)
3 GRangesList: 基因组范围的集合
# 基因组上多组区间范围(比如多个基因的exon),一些转录本聚合展示重要的基因组信息,比如构成外显子
# 我理解就和基本数据结构中list类似
gr1 <- GRanges(
seqnames = "chr2",
ranges = IRanges(103, 106),
strand = "+",
score = 5L, GC = 0.45)
gr2 <- GRanges(
seqnames = c("chr1", "chr1"),
ranges = IRanges(c(107, 113), width = 3),
strand = c("+", "-"),
score = 3:4, GC = c(0.3, 0.5))
grl <- GRangesList("txA" = gr1, "txB" = gr2)
grl
3.1 GRangesList基本信息
#与GRanges类似,但返回list
seqnames(grl)
ranges(grl)
strand(grl)
# length()和names()返回列表的长度/名称
length(grl)
names(grl)
seqlengths(grl)# seqlengths方法将返回子集list序列长度
# elementNROWS()输出各子集行数
# 比data.frame中lapply更快
elementNROWS(grl)
# isEmpty测试GRangesList中是否含空对象
isEmpty(grl)
# 与GRanges类似,查看元数据,但返回list
mcols(grl) <- c("Transcript A","Transcript B")
mcols(grl)
# 取每个元素要unlist
# mcols(unlist(grl))
3.2 Combining GRangesList objects组合操作
ul <- unlist(grl)
ul
# Append lists using append or c
# 两个GRangelist有parallel,想组合成1个GRangelist
# 1.pc() – parallel (element-wise) c().
# 2.连接lists,然后按某个因素重新分组,在本例中用元素名称
grl1 <- GRangesList(
gr1 = GRanges("chr2", IRanges(3, 6)),
gr2 = GRanges("chr1", IRanges(c(7,13), width = 3)))
grl2 <- GRangesList(
gr1 = GRanges("chr2", IRanges(9, 12)),
gr2 = GRanges("chr1", IRanges(c(25,38), width = 3)))
pc(grl1, grl2)
grl3 <- c(grl1, grl2)
regroup(grl3, names(grl3))
3.3 基本间隔操作GRangesList
start(grl)
end(grl)
width(grl)
#整数List, 元素全是整数的list
sum(width(grl))
#移动、计算覆盖率等
shift(grl, 20)
coverage(grl)
3.4 取子集
#类似list,用[[]]和$返回GRanges对象,用[]返回一个list
grl[1]
grl[[1]]
grl["txA"]
grl$txB
#取元数据,多加一个参数
grl[1, "score"]
grl["txB", "GC"]
#基本操作head, tail, rep, rev, window
rep(grl[[1]], times = 3)
rev(grl)
head(grl, n=1)
tail(grl, n=1)
window(grl, start=1, end=1)
grl[IRanges(start=2, end=2)]
3.5 循环操作Looping over GRangesList objects
#lapply,sapply, mapply, endoapply, mendoapply,Map, and Reduce.
lapply(grl, length)#对list中各个对象求长度,返回list
sapply(grl, length)#对list中各个对象求长度
grl2 <- shift(grl, 10)
names(grl2) <- c("shiftTxA", "shiftTxB")
mapply(c, grl, grl2)#合并多个list中的对象
#如果不希望简化结果,则可调用Map方法,结果同mapply但不简化输出。
Map(c, grl, grl2)
# 返回原始GRangesList
endoapply(grl, rev)
mendoapply(c, grl, grl2)
# 去除冗余,把所有对象合并到GRangesList
Reduce(c, grl)
# lapply and friends很慢
# 1.?S4groupGeneric union, punion
# 2.unlist 和relist
# 对list中某一个object进行操作
gr <- unlist(grl)
gr$log_score <- log(gr$score)#增加一列meta信息log_socre,求socre的对数
grl <- relist(gr, grl)
grl
#see ?extractList for more information
4.间隔重叠GRanges和GRangesList
#最常用,包含索引配对,也有select=“first/last/arbitrary”参数
#gr[gr %over% grl] 存在于gr中的gr与grl的overlap
findOverlaps(gr, grl)
#查询中每个元素的重叠数。
countOverlaps(gr, grl)
#直接输出我们想要得到的重叠的ranges
subsetByOverlaps(gr,grl)
#用select参数获取查询中每个元素的主题中第一个重叠元素的索引。
findOverlaps(gr, grl, select="first")
findOverlaps(grl, gr, select="first")
5.Session Information
sessionInfo()
C. GRanges包中常见问题梳理
我们可以看到热点问题依然是关于Overlaps,看来仍需要加深学习,你有没有遇到类似的问题或有解决办法,来参与讨论吧~
- A: how to retrieve the nucleotide/base in a certain position using any programming 如何使用任意程序在特定位置检索核苷酸/碱基
See GenomicRanges. Its documentation is really helpful.
- 答:A: Creating fasta files to BLAST- what's that fastest way创建BLAST的Fasta文件-最快的方法是什么
在R / Bioconductor软件包IRanges和GenomicRanges上? - A: How To Write Data In A Granges Object To A Bed File. 如何将Granges对象中的数据写入bed文件。
[The GenomicRanges vignette] - A: Extract intergenic coordinate from gff 从gff提取基因间坐标
the functiongaps
(in packages IRanges and GenomicRanges) after importing the GFF file. - A: Pulling out interval adjoining regions 间隔相邻区域
Flank函数 考虑染色体长度
https://bioconductor.org/packages/3.7/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf - A: Plotting TSS & TTS for Chip-seq peaks 绘制芯片序列峰值的TSS和TTS
The distances can easily be found with [GenomicRanges] - Get indices from GenomicRanges reduce function 从GenomicRanges获取索引
- A: How do I find the associated genes of TFBS when I have the TFBS starting and end 有TFBS开始和结束时,如何找到TFBS的相关基因
bedops closest-features nearest() in the GenomicRanges package - A: Ensembl regulatory build promoter region and gene name 整合调控构建启动子区域和基因名称
bedtools intersect
orfindOverlaps()
(via GenomicRanges in R) or any of the other "interval overlap" - A: distances betweeen my SNV and nearest exon 我的SNV和最近的外显子之间的距离
GenomicRanges in R by using the distanceToNearest() function - A: how to compare variances with VCF file ? 如何与VCF文件比较差异?
bedtools, bedops, vcftools, or VariantAnnotation/GenomicRanges (Bioconductor). - A: the closest gene to a breakpoint 最接近断点的基因
nearest()
from [GenomicRanges] - A: What Is The Best Library For Handling Interval Logic? 什么是处理间隔逻辑的最佳库?
packages IRanges or GenomicRanges - A: Peaks And Nearby Genes 峰和附近的基因
biomaRt, rtracklayer, org*, GenomicFeatures, and GenomicRanges packages provide flexibility for more advanced - A: How To Plot The Coverage Of A Region In The Genome 如何绘制基因组中某个区域的覆盖范围
See the GenomicRanges package for dealing with aligned reads and regions. - A: Mapping and annotating DNA binding regions from ChIP-Seq to nearby gene 映射和注释从ChIP-Seq到附近基因的DNA结合区域
You can just use GenomicRanges with your peaks BED file and your annotation GFF (after they have both been converted to GenomicRanges objects. There is a parameter in the GenomicRangesfindOverlaps()
function - A: Problem in using TCGAvisualize_meanMethylation function 使用TCGAvisualize_meanMethylation函数时出现问题
library(SummarizedExperiment) library(GenomicRanges) They are already installed on your R version - A: Get Rs Number Based On Position (6 million SNPs) 根据位置获取卢比数(600万个SNP)
file from UCSC according to your build. Use GenomicRanges package from bioconductor in R to overlap the overlapped region. But you need to know to use GenomicRanges package for this. Its very fast and easy to - A: Combine regions of an interaction matrix 合并互动矩阵的区域
- an associated n-row GenomicRanges object,
gc$ranges
; and a GenomicRanges containing your bed-data <- function(gc, bed){ require(GenomicRanges) # tests ol <- findOverlaps(query)
- A: Annotate SNV (or breakpoint) data 注释SNV(或断点)数据
package GenomicRanges, but I prefer the command line tool bedtools - A: Locating Indels In Gene 在基因中定位插入缺失
if you want a command-line solution or the GenomicRanges package from Bioconductor to do genomic overlaps - A: How to creat a GRangesList object from TCGA CNV data 如何从TCGA CNV数据创建GRangesList对象
alterations for your cancer of interest, as a GenomicRanges object: https://www.biostars.org/p/311199/#311444 - A: R: Readaligned Only Junction Reads From Bam-File R:仅从Bam文件读取并对齐的结点
take a look at readGappedAlignments in the GenomicRanges package. Once you read in your sequences, call - A: How to extract fasta sequences from assembled transcripts generated by Stringtie
use R GRanges object and getSeq function from GenomicRanges and BSgenome packages to retrive sequences. 如何从Stringtie生成的汇编转录本中提取fasta序列 - A: Code for looking for overlaps
analyze genomic ranges in R (great choice!) - GenomicRanges from Bioconductor is all what you need. And
org/biocLite.R") biocLite("GenomicRanges") library("GenomicRanges") - A: How to get the total number of reads overlapping RNA regions in R如何获取R中重叠RNA区域的读取总数
In the GenomicRanges package, there is a findOverlaps() function. - A: Getting Counts From Samtools从Samtools获取计数
user, take a look at the GenomicFeatures and GenomicRanges packages - A: How to get the number of exon within genes from GRCH37 reference如何从GRCH37参考文献中获得基因中外显子的数量
awk for exon 4. Load into R 5. reduce() in GenomicRanges 6. done - Common genomic intervals in R——
GenomicRanges
package R中常见的基因组间隔 - Bioconductor - Error : Function found is not S4 generic
- A: comparing 2 long BED files in R in an efficient way 高效地比较R中的2个长BED文件
library(GenomicRanges) a = c(a, subsetByOverlaps(b - Trouble with GenomicFeatures Package GenomicFeatures程序包出现问题
- A: Generic Bioconductor Object To Integrate Different Genomics Data Formats/Platform答:集成不同基因组学数据格式/平台
The GenomicRanges packages is the right place to look. There is also some effort to make the SummarizedExperiment
SummarizedExperiment class in the GenomicRanges package to be this "universal container of assay data“化验数据的通用容器”。 - A: Loading Large Bed Files Into Bioconductor将Large Bed Files加载到生物导体中
BioConductor" what specifically have you tried? The GenomicRanges and rtracklayer packages make this straightforward
Examining peaks could be as simple as: library(GenomicRanges) library(rtracklayer) reads <- import - How to resize a GenomicRange objects centered on a DNA motif 如何调整以DNA图案为中心的GenomicRange对象的大小
How to resize a GenomicRange objects centered on a DNA motif Hi there, I have a bed file from MACS, this? I know one can use resize function for GenomicRanges object, but it can only center on the middle
by tangming2005 - A: How to extend bed intervals to a uniform size? 如何将bed间隔延长到统一大小?
R using the rtracklayer package and use the GenomicRanges package to resize your ranges to a fixed width: 使用rtracklayer包R并使用GenomicRanges包将范围调整为固定宽度:
bed("userFile.bed") library("GenomicRanges") resizeRanges <- resize(userRanges
by James Ashmore
37.A: How To Find Number Of Mapped Reads For A Series Of Windows 答:如何查找一系列Windows的映射读取数
approaches using different software packages (GenomicRanges, bedops, ....)
by Sean Davis - A: Small RNA sequence, how to calculate read counts? What's the criterion? 小RNA序列,如何计算读数计数?准则是什么?
Is there featureCounts from the subread package, and GenomicRanges summarizeOverlaps().来自subread包的featureCounts和GenomicRanges summaryOverlaps()
by Sean Davis - A: bigWig to bed for regions above/below threshold答:bigWig可以在阈值高于/低于阈值的区域上就寝
the bigwig file. 2. Useslice()
from the GenomicRanges bioconductor package on the results from #1
by Sean Davis - A: How to retrieve the genes associated to a VCF file? 如何检索与VCF文件相关的基因?
rs876643) and position? If so you should use the GenomicRanges package from Bioconductor to map the position
by paolo002 - A: Retrieve all the differentially methylated cpg's in a region 答:检索区域中所有差异甲基化的cpg
GenomicRanges is perfectly happy with ranges sharing the same start/end coordinate: gr <-
by Devon Ryan - A: Quick And Easy Way To Find Whether A Locus Is Falling In Which Region Of Exon Or 一种快速简便的方法来查找基因座是否落在外显子或哪个区域
then use the various nearest() functions from GenomicRanges/IRanges to get an index of the nearest feature
by dpryan79 - A: What Are The Best Qc Tools For Exome Seq, Rna Seq, Single/Paired End, Etc? 什么是外显子序列,Rna序列,单/配对末端等的最佳Qc工具?
capabilities. In R, the Rsamtools, ShortRead, and GenomicRanges packages are of interest. In python, look at
by Sean Davis - A: convert bed file to GRangeList object 将bed文件转换为GRangeList对象
= import("file.bed") library(GenomicRanges) gr_list = split(gr_obj, gr_obj$name) More
by igor - Annotating a GenomicRanges object in R 在R中注释GenomicRanges对象
Annotating a GenomicRanges object in R Hi, I have a GenomicRange object that I want to annotate. It's
by ElCascador - How do I find the set of a genomicrange? (same as union(gr, gr)) 如何找到一个基因组范围?(与union(gr,gr)相同)
find the set of a genomicrange? (same as union(gr, gr)) library(GenomicRanges) gr0 <-
by endrebak852 - A: Mapping Protein Domains To Exons将蛋白质结构域映射到外显子
information as a GRanges object (from the GenomicRanges bioconductor package). Use findOverlaps() method
by Sean Davis - A: Manipulating Wig File In R,
I would highly recommend using GenomicRanges: http://www.bioconductor.org/packages/2.11/bioc/html/GenomicRanges
by Sebastian Kurscheid
参考资料(方便想进一步了解的去看源文件):
[1] R- GenomicRanges使用作者:刘小泽
[2] Triplex: an R/Bioconductor package for identification and visualization of potential intramolecular triplex patterns in DNA sequences
[3] An Introduction to the GenomicRanges Package