GEOquery 包使用指南
- 1 Overview of GEO
- 1.1 Platforms
- 1.2 Samples
- 1.3 Series
- 1.4 Datasets
- 2 Getting Started using GEOquery
- 3 GEOquery Data Structures
- 3.1 The GDS, GSM, and GPL classes
- 3.2 The GSE class
- 4 Converting to BioConductor ExpressionSets and limma MALists
- 4.1 Getting GSE Series Matrix files as an ExpressionSet
- 4.2 Converting GDS to an ExpressionSet
- 4.3 Converting GDS to an MAList
- 4.4 Converting GSE to an ExpressionSet
- 5 Accessing Raw Data from GEO
- 6 Use Cases
- 6.1 Getting all Series Records for a Given Platform
# 1 GEO 介绍
GEO(The NCBI Gene Expression Omnibus)是NCBI专门储存高通量测序的库。如基于芯片数据(mRNA、DNA、蛋白丰度),蛋白质质谱数据和高通量测序数据。
GEO数据主要有4种基本类型。Sample, Platform 和 Series是由作者上传的数据,dataset是由GEO官方从做和提交的数据整理出来的。
## 1.1 Platforms
GEO 号:GPLxxx。
芯片的组成信息,例如 cDNAs, oligonucleotide probesets, ORFs, antibodies 。或者其它定量检测平台信息,例如SAGE tags, peptides。
## 1.2 Samples
GEO 号: GSMxxx
描述单个样本信息,处理步骤、处理条件以及实验测得的结果。一个样本可能属于多个研究(Series)。
## 1.3 Series
GEO 号:GSExxx
涉及同一个研究的记录,包括处理过的数据、总结和分析;信息可以从GSEMatrix文件解析快速得到。
##1.4 Datasets
GEO 号:GDSxxx
一套经过整理的GEO 数据集。每套数据都是可以进行生物学或者统计学上比较的样本,是GEO自带工具进行数据分析和展示的基础。一个 GDS数据集来自同一个平台,数据分析和标准化都具有一致性。
# 2 GEOquery快速上手
#GEOquery包安装:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
#使用
library("GEOquery")
gsm <- getGEO(filename=system.file("extdata/GSM11805.txt.gz",package="GEOquery"))
getGEO
函数可以从GEO官网获取数据或者将固定格式数据解析为R格式的数据。
# 3 GEOquery 数据结构
GEOquery 数据结构大致分为两类。第一种是GDS, GPL和GSM,他们的操作和数据类型差不多;第二种是GSE,GSE数据是由GSM和GPL整合而成。
## 3.1 GDS, GSM 和 GPL
这些数据类组成
- 一个元数据头(与the SOFT 格式的头差不多)
- 一个GEODataTable(数据)。
可以使用show()查看这些数据类。
- 查看 gsm metadata
head(Meta(gsm))
## $channel_count
## [1] "1"
##
## $comment
## [1] "Raw data provided as supplementary file"
##
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
- 查看 gsm data
Table(gsm)[1:5,]
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
- 查看GSM数据列的含义
Columns(gsm)
## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
head(Meta(gds))
Table(gds)[1:5,]
Columns(gds)
Columns(gds)[1,]
##3.2 GSE类
GSE类组成:
- 一个元数据头
- GPLList,列表,包含GPL信息
- GSMList,列表,包含GSM信息
-
- 包含GEODataTable
gse <- getGEO("GSE781",GSEMatrix=FALSE)
head(Meta(gse))
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "[email protected]"
##
## $contact_fax
## [1] "617-414-1646"
# names of all the GSM objects contained in the GSE
names(GSMList(gse))
## [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
## [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"
# and get the first GSM object on the list
GSMList(gse)[[1]]
## An object of class "GSM"
## channel_count
## [1] "1"
## comment
## [1] "Raw data provided as supplementary file"
## contact_address
## [1] "715 Albany Street, E613B"
## contact_city
## [1] "Boston"
## contact_country
## [1] "USA"
## contact_department
## [1] "Genetics and Genomics"
## contact_email
## [1] "[email protected]"
## contact_fax
## [1] "617-414-1646"
## contact_institute
## [1] "Boston University School of Medicine"
## contact_name
## [1] "Marc,E.,Lenburg"
## contact_phone
## [1] "617-414-1375"
## contact_state
## [1] "MA"
## contact_web_link
## [1] "http://gg.bu.edu"
## contact_zip/postal_code
## [1] "02130"
## data_row_count
## [1] "22283"
## description
## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
## [2] "Keywords = kidney"
## [3] "Keywords = renal"
## [4] "Keywords = RCC"
## [5] "Keywords = carcinoma"
## [6] "Keywords = cancer"
## [7] "Lot batch = 2004638"
## geo_accession
## [1] "GSM11805"
## last_update_date
## [1] "May 28 2005"
## molecule_ch1
## [1] "total RNA"
## organism_ch1
## [1] "Homo sapiens"
## platform_id
## [1] "GPL96"
## series_id
## [1] "GSE781"
## source_name_ch1
## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
## status
## [1] "Public on Nov 25 2003"
## submission_date
## [1] "Oct 20 2003"
## supplementary_file
## [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM11nnn/GSM11805/GSM11805.CEL.gz"
## title
## [1] "N035 Normal Human Kidney U133A"
## type
## [1] "RNA"
## An object of class "GEODataTable"
## ****** Column Descriptions ******
## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
## ****** Data Table ******
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
## 22278 more rows ...
# and the names of the GPLs represented
names(GPLList(gse))
## [1] "GPL96" "GPL97"
# 4 Converting to BioConductor ExpressionSets and limma MALists
GEO datasets与limma 数据结构MAList 和Biobase数据结构 ExpressionSet比较相似。可以相互转换:
- GDS2MA 和 GDS2eSet
## 4.1 Getting GSE Series Matrix files as an ExpressionSet
GEO Series是一套实验数据的集合,有SOFT,MINiML格式文件,以及一个 Series Matrix File(s)文本。Series Matrix File是tab-delimited text,getGEO
函数可以解析,解析结果就是ExpressionSets。
# Note that GSEMatrix=TRUE is the default
# GSEMatrix=TRUE下载GSE2553_series_matrix.txt.gz和GPL1977.soft文件
# GSEMatrix=F下载GSE2553.soft.gz
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
show(gse2553)
show(gse2553)
## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
## varLabels: title geo_accession ... data_row_count (30 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 12600 (12600 total)
## fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 16230383
## Annotation: GPL1977
phenoData(gse2553[[1]]))
pData(gse2553[[1]]))
featureData(gse2553[[1]]))
fData(gse2553[[1]]))
experimentData(gse2553[[1]]))
一个GSE下如果存在多个GPL测序,筛选特定的GPL数据;GSE会有多个列表gset[[idx]]
show(pData(gse2553[[1]])[1:5,c(1,6,8)])
## title type
## GSM48681 Patient sample ST18, Dermatofibrosarcoma RNA
## GSM48682 Patient sample ST410, Ewing Sarcoma RNA
## GSM48683 Patient sample ST130, Sarcoma, NOS RNA
## GSM48684 Patient sample ST293, Malignant Peripheral Nerve Sheath Tumor RNA
## GSM48685 Patient sample ST367, Liposarcoma RNA
## source_name_ch1
## GSM48681 Dermatofibrosarcoma
## GSM48682 Ewing Sarcoma
## GSM48683 Sarcoma, NOS
## GSM48684 Malignant Peripheral Nerve Sheath Tumor
## GSM48685 Liposarcoma
##4.2 Converting GDS to an ExpressionSet
eset <- GDS2eSet(gds,do.log2=TRUE)
eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22645 features, 17 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
## varLabels: sample disease.state individual description
## varMetadata: labelDescription
## featureData
## featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
## fvarLabels: ID Gene title ... GO:Component ID (21 total)
## fvarMetadata: Column labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 14641932
## Annotation:
pData(eset)[,1:3]
## sample disease.state individual
## GSM11815 GSM11815 RCC 035
## GSM11832 GSM11832 RCC 023
## GSM12069 GSM12069 RCC 001
## GSM12083 GSM12083 RCC 005
## GSM12101 GSM12101 RCC 011
## GSM12106 GSM12106 RCC 032
## GSM12274 GSM12274 RCC 2
## GSM12299 GSM12299 RCC 3
## GSM12412 GSM12412 RCC 4
## GSM11810 GSM11810 normal 035
## GSM11827 GSM11827 normal 023
## GSM12078 GSM12078 normal 001
## GSM12099 GSM12099 normal 005
## GSM12269 GSM12269 normal 1
## GSM12287 GSM12287 normal 2
## GSM12301 GSM12301 normal 3
## GSM12448 GSM12448 normal 4
##4.3 Converting GDS to an MAList
ExpressionSet不包含注释信息,getGEO
可以帮助我们获取。
#get the platform from the GDS metadata
Meta(gds)$platform
## [1] "GPL97"
#So use this information in a call to getGEO
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))
与ExpressionSet不同,the limma MAList
包含基因注释信息。上面的gpl包含注释信息。
MAList不仅包含数据,还包含样本信息,和注释信息。
MA <- GDS2MA(gds,GPL=gpl)
class(MA)##
[1] "MAList"
## attr(,"package")
## [1] "limma"
4.4 Converting GSE to an ExpressionSet
GSE转换成ExpressionSet
gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
head(gsmplatforms)
## $GSM11805
## [1] "GPL96"
##
## $GSM11810
## [1] "GPL97"
##
## $GSM11814
## [1] "GPL96"
##
## $GSM11815
## [1] "GPL97"
##
## $GSM11823
## [1] "GPL96"
##
## $GSM11827
## [1] "GPL97"
这个GSE包含两个GPLs,GPL96 和 GPL97。
筛选使用GPL96 的GSM。
gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
length(gsmlist)
## [1] 17
Table(gsmlist[[1]])[1:5,]
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
the single measurement for each array is called the
VALUE
column
# and get the column descriptions
Columns(gsmlist[[1]])[1:5,]
获取表达矩阵:
# get the probeset ordering
probesets <- Table(GPLList(gse)[[1]])$ID
colnames(Table(GPLList(gse)[[1]]))
[1] "ID" "GB_ACC" "SPOT_ID"
[4] "Species Scientific Name" "Annotation Date" "Sequence Type"
[7] "Sequence Source" "Target Description" "Representative Public ID"
[10] "Gene Title" "Gene Symbol" "ENTREZ_GENE_ID"
[13] "RefSeq Transcript ID" "Gene Ontology Biological Process" "Gene Ontology Cellular Component"
[16] "Gene Ontology Molecular Function"
# make the data matrix from the VALUE columns from each GSM
# being careful to match the order of the probesets in the platform
# with those in the GSMs
data.matrix <- do.call('cbind',lapply(gsmlist,function(x)
{tab <- Table(x)
mymatch <- match(probesets,tab$ID_REF)
return(tab$VALUE[mymatch])
}))
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,]
## GSM11805 GSM11814 GSM11823 GSM11830 GSM12067 GSM12075 GSM12079
## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376 11.222795 11.469845
## [2,] 5.749534 7.908092 7.093814 7.514122 7.901470 6.407693 5.165912
## [3,] 7.066089 7.750205 7.244126 7.962896 7.337176 6.569856 7.477354
## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568 12.529870 12.240046
## [5,] 6.195741 6.061776 6.565293 6.583459 6.877744 6.652486 3.981853
## GSM12098 GSM12100 GSM12105 GSM12268 GSM12270 GSM12283 GSM12298
## [1,] 10.823367 10.835971 10.810893 11.062653 10.323055 11.181028 11.566387
## [2,] 6.556123 8.207014 6.816344 6.563768 7.353147 5.770829 6.912889
## [3,] 7.708739 7.428779 7.754888 7.126188 8.742815 7.339850 7.602142
## [4,] 12.336534 11.762839 11.237509 12.412490 11.213408 12.678380 12.232901
## [5,] 5.501439 6.247928 6.017922 6.525129 6.683696 5.918863 5.837943
## GSM12300 GSM12399 GSM12444
## [1,] 11.078151 11.535178 11.105450
## [2,] 4.812498 7.471675 7.488644
## [3,] 7.383704 7.432959 7.381110
## [4,] 12.090939 11.421802 12.172834
## [5,] 6.281698 5.419539 5.469235
构造ExpressionSet
require(Biobase)
# go through the necessary steps to make a compliant ExpressionSet
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(gsmlist)
pdata <- data.frame(samples=names(gsmlist))
rownames(pdata) <- names(gsmlist)
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 17 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
## varLabels: samples
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
5 Accessing Raw Data from GEO
- getGEOSuppFiles函数
6 Use Cases
##6.1 Getting all Series Records for a Given Platform
- 获取GPL相关的所有GSE 和 GSM accessions
gpl97 <- getGEO('GPL97')
Meta(gpl97)$title
## [1] "[HG-U133B] Affymetrix Human Genome U133B Array"
head(Meta(gpl97)$series_id)
## [1] "GSE362" "GSE473" "GSE620" "GSE674" "GSE781" "GSE907"
length(Meta(gpl97)$series_id)
## [1] 165
head(Meta(gpl97)$sample_id)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930" "GSM3932"
length(Meta(gpl97)$sample_id)
## [1] 7917
gsmids <- Meta(gpl97)$sample_id
gsmlist <- sapply(gsmids[1:5],getGEO)
names(gsmlist)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930"
英文版原文见:[Using the GEOquery Package