GEOquery 包使用指南

1 Overview of GEO
- 1.1 Platforms
- 1.2 Samples
- 1.3 Series
- 1.4 Datasets
2 Getting Started using GEOquery
3 GEOquery Data Structures
- 3.1 The GDS, GSM, and GPL classes
- 3.2 The GSE class
4 Converting to BioConductor ExpressionSets and limma MALists
- 4.1 Getting GSE Series Matrix files as an ExpressionSet
- 4.2 Converting GDS to an ExpressionSet
- 4.3 Converting GDS to an MAList
- 4.4 Converting GSE to an ExpressionSet
5 Accessing Raw Data from GEO
6 Use Cases
- 6.1 Getting all Series Records for a Given Platform

# 1 GEO 介绍

GEO（The NCBI Gene Expression Omnibus）是NCBI专门储存高通量测序的库。如基于芯片数据（mRNA、DNA、蛋白丰度），蛋白质质谱数据和高通量测序数据。
GEO数据主要有4种基本类型。Sample, Platform 和 Series是由作者上传的数据，dataset是由GEO官方从做和提交的数据整理出来的。

## 1.1 Platforms
GEO 号：GPLxxx。
芯片的组成信息，例如 cDNAs, oligonucleotide probesets, ORFs, antibodies 。或者其它定量检测平台信息，例如SAGE tags, peptides。

## 1.2 Samples
GEO 号: GSMxxx

描述单个样本信息，处理步骤、处理条件以及实验测得的结果。一个样本可能属于多个研究（Series）。

## 1.3 Series
GEO 号：GSExxx

涉及同一个研究的记录，包括处理过的数据、总结和分析；信息可以从GSEMatrix文件解析快速得到。

##1.4 Datasets
GEO 号：GDSxxx

一套经过整理的GEO 数据集。每套数据都是可以进行生物学或者统计学上比较的样本，是GEO自带工具进行数据分析和展示的基础。一个 GDS数据集来自同一个平台，数据分析和标准化都具有一致性。

# 2 GEOquery快速上手

#GEOquery包安装：
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")

#使用
library("GEOquery")
gsm <- getGEO(filename=system.file("extdata/GSM11805.txt.gz",package="GEOquery"))

getGEO函数可以从GEO官网获取数据或者将固定格式数据解析为R格式的数据。

# 3 GEOquery 数据结构

GEOquery 数据结构大致分为两类。第一种是GDS, GPL和GSM,他们的操作和数据类型差不多；第二种是GSE，GSE数据是由GSM和GPL整合而成。

## 3.1 GDS, GSM 和 GPL

这些数据类组成

一个元数据头(与the SOFT 格式的头差不多)
一个GEODataTable(数据)。

可以使用show()查看这些数据类。

查看 gsm metadata

head(Meta(gsm))

## $channel_count
## [1] "1"
## 
## $comment
## [1] "Raw data provided as supplementary file"
## 
## $contact_address
## [1] "715 Albany Street, E613B"
## 
## $contact_city
## [1] "Boston"
## 
## $contact_country
## [1] "USA"
## 
## $contact_department
## [1] "Genetics and Genomics"

查看 gsm data

Table(gsm)[1:5,]
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P

查看GSM数据列的含义

Columns(gsm)
##     Column
## 1   ID_REF
## 2    VALUE
## 3 ABS_CALL
##                                                                  Description
## 1                                                                           
## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065

head(Meta(gds))
Table(gds)[1:5,]
Columns(gds)
Columns(gds)[1,]

##3.2 GSE类

GSE类组成：

一个元数据头
GPLList，列表，包含GPL信息
GSMList，列表，包含GSM信息
- 包含GEODataTable

gse <- getGEO("GSE781",GSEMatrix=FALSE)

head(Meta(gse))
## $contact_address
## [1] "715 Albany Street, E613B"
## 
## $contact_city
## [1] "Boston"
## 
## $contact_country
## [1] "USA"
## 
## $contact_department
## [1] "Genetics and Genomics"
## 
## $contact_email
## [1] "[email protected]"
## 
## $contact_fax
## [1] "617-414-1646"

# names of all the GSM objects contained in the GSE
names(GSMList(gse))
##  [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
##  [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"

# and get the first GSM object on the list
GSMList(gse)[[1]]
## An object of class "GSM"
## channel_count 
## [1] "1"
## comment 
## [1] "Raw data provided as supplementary file"
## contact_address 
## [1] "715 Albany Street, E613B"
## contact_city 
## [1] "Boston"
## contact_country 
## [1] "USA"
## contact_department 
## [1] "Genetics and Genomics"
## contact_email 
## [1] "[email protected]"
## contact_fax 
## [1] "617-414-1646"
## contact_institute 
## [1] "Boston University School of Medicine"
## contact_name 
## [1] "Marc,E.,Lenburg"
## contact_phone 
## [1] "617-414-1375"
## contact_state 
## [1] "MA"
## contact_web_link 
## [1] "http://gg.bu.edu"
## contact_zip/postal_code 
## [1] "02130"
## data_row_count 
## [1] "22283"
## description 
## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
## [2] "Keywords = kidney"                                                                                                                                                                                                                                                                                                                                                                           
## [3] "Keywords = renal"                                                                                                                                                                                                                                                                                                                                                                            
## [4] "Keywords = RCC"                                                                                                                                                                                                                                                                                                                                                                              
## [5] "Keywords = carcinoma"                                                                                                                                                                                                                                                                                                                                                                        
## [6] "Keywords = cancer"                                                                                                                                                                                                                                                                                                                                                                           
## [7] "Lot batch = 2004638"                                                                                                                                                                                                                                                                                                                                                                         
## geo_accession 
## [1] "GSM11805"
## last_update_date 
## [1] "May 28 2005"
## molecule_ch1 
## [1] "total RNA"
## organism_ch1 
## [1] "Homo sapiens"
## platform_id 
## [1] "GPL96"
## series_id 
## [1] "GSE781"
## source_name_ch1 
## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
## status 
## [1] "Public on Nov 25 2003"
## submission_date 
## [1] "Oct 20 2003"
## supplementary_file 
## [1] "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/supplementary/samples/GSM11nnn/GSM11805/GSM11805.CEL.gz"
## title 
## [1] "N035 Normal Human Kidney U133A"
## type 
## [1] "RNA"
## An object of class "GEODataTable"
## ****** Column Descriptions ******
##     Column
## 1   ID_REF
## 2    VALUE
## 3 ABS_CALL
##                                                                  Description
## 1                                                                           
## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065
## ****** Data Table ******
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P
## 22278 more rows ...

# and the names of the GPLs represented
names(GPLList(gse))
## [1] "GPL96" "GPL97"

# 4 Converting to BioConductor ExpressionSets and limma MALists

GEO datasets与limma 数据结构MAList 和Biobase数据结构 ExpressionSet比较相似。可以相互转换：

GDS2MA 和 GDS2eSet

## 4.1 Getting GSE Series Matrix files as an ExpressionSet
GEO Series是一套实验数据的集合，有SOFT，MINiML格式文件，以及一个 Series Matrix File(s)文本。Series Matrix File是tab-delimited text，getGEO函数可以解析，解析结果就是ExpressionSets。

# Note that GSEMatrix=TRUE is the default
# GSEMatrix=TRUE下载GSE2553_series_matrix.txt.gz和GPL1977.soft文件
# GSEMatrix=F下载GSE2553.soft.gz
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
show(gse2553)

show(gse2553)
## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
##   varLabels: title geo_accession ... data_row_count (30 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1 2 ... 12600 (12600 total)
##   fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 16230383 
## Annotation: GPL1977

phenoData(gse2553[[1]]))
pData(gse2553[[1]]))
featureData(gse2553[[1]]))
fData(gse2553[[1]]))
experimentData(gse2553[[1]]))

一个GSE下如果存在多个GPL测序，筛选特定的GPL数据；GSE会有多个列表gset[[idx]]

show(pData(gse2553[[1]])[1:5,c(1,6,8)])
##                                                                  title type
## GSM48681                      Patient sample ST18, Dermatofibrosarcoma  RNA
## GSM48682                           Patient sample ST410, Ewing Sarcoma  RNA
## GSM48683                            Patient sample ST130, Sarcoma, NOS  RNA
## GSM48684 Patient sample ST293, Malignant Peripheral Nerve Sheath Tumor  RNA
## GSM48685                             Patient sample ST367, Liposarcoma  RNA
##                                  source_name_ch1
## GSM48681                     Dermatofibrosarcoma
## GSM48682                           Ewing Sarcoma
## GSM48683                            Sarcoma, NOS
## GSM48684 Malignant Peripheral Nerve Sheath Tumor
## GSM48685                             Liposarcoma

##4.2 Converting GDS to an ExpressionSet

eset <- GDS2eSet(gds,do.log2=TRUE)

eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22645 features, 17 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
##   varLabels: sample disease.state individual description
##   varMetadata: labelDescription
## featureData
##   featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
##   fvarLabels: ID Gene title ... GO:Component ID (21 total)
##   fvarMetadata: Column labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 14641932 
## Annotation:

pData(eset)[,1:3]
##            sample disease.state individual
## GSM11815 GSM11815           RCC        035
## GSM11832 GSM11832           RCC        023
## GSM12069 GSM12069           RCC        001
## GSM12083 GSM12083           RCC        005
## GSM12101 GSM12101           RCC        011
## GSM12106 GSM12106           RCC        032
## GSM12274 GSM12274           RCC          2
## GSM12299 GSM12299           RCC          3
## GSM12412 GSM12412           RCC          4
## GSM11810 GSM11810        normal        035
## GSM11827 GSM11827        normal        023
## GSM12078 GSM12078        normal        001
## GSM12099 GSM12099        normal        005
## GSM12269 GSM12269        normal          1
## GSM12287 GSM12287        normal          2
## GSM12301 GSM12301        normal          3
## GSM12448 GSM12448        normal          4

##4.3 Converting GDS to an MAList
ExpressionSet不包含注释信息，getGEO可以帮助我们获取。

#get the platform from the GDS metadata
Meta(gds)$platform
## [1] "GPL97"

#So use this information in a call to getGEO
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))

与ExpressionSet不同，the limma MAList包含基因注释信息。上面的gpl包含注释信息。

MAList不仅包含数据，还包含样本信息，和注释信息。

MA <- GDS2MA(gds,GPL=gpl)
class(MA)## 
[1] "MAList"
## attr(,"package")
## [1] "limma"

4.4 Converting GSE to an ExpressionSet
GSE转换成ExpressionSet

gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
head(gsmplatforms)

## $GSM11805
## [1] "GPL96"
## 
## $GSM11810
## [1] "GPL97"
## 
## $GSM11814
## [1] "GPL96"
## 
## $GSM11815
## [1] "GPL97"
## 
## $GSM11823
## [1] "GPL96"
## 
## $GSM11827
## [1] "GPL97"

这个GSE包含两个GPLs，GPL96 和 GPL97。

筛选使用GPL96 的GSM。

gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
length(gsmlist)
## [1] 17

Table(gsmlist[[1]])[1:5,]
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P

the single measurement for each array is called the VALUE column

# and get the column descriptions
Columns(gsmlist[[1]])[1:5,]

获取表达矩阵：

# get the probeset ordering
probesets <- Table(GPLList(gse)[[1]])$ID
colnames(Table(GPLList(gse)[[1]]))
[1] "ID"                               "GB_ACC"                           "SPOT_ID"                         
[4] "Species Scientific Name"          "Annotation Date"                  "Sequence Type"                   
[7] "Sequence Source"                  "Target Description"               "Representative Public ID"        
[10] "Gene Title"                       "Gene Symbol"                      "ENTREZ_GENE_ID"                  
[13] "RefSeq Transcript ID"             "Gene Ontology Biological Process" "Gene Ontology Cellular Component"
[16] "Gene Ontology Molecular Function"

# make the data matrix from the VALUE columns from each GSM
# being careful to match the order of the probesets in the platform
# with those in the GSMs
data.matrix <- do.call('cbind',lapply(gsmlist,function(x) 
{tab <- Table(x)
mymatch <- match(probesets,tab$ID_REF)
return(tab$VALUE[mymatch])
}))
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,]

##       GSM11805  GSM11814  GSM11823  GSM11830  GSM12067  GSM12075  GSM12079
## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376 11.222795 11.469845
## [2,]  5.749534  7.908092  7.093814  7.514122  7.901470  6.407693  5.165912
## [3,]  7.066089  7.750205  7.244126  7.962896  7.337176  6.569856  7.477354
## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568 12.529870 12.240046
## [5,]  6.195741  6.061776  6.565293  6.583459  6.877744  6.652486  3.981853
##       GSM12098  GSM12100  GSM12105  GSM12268  GSM12270  GSM12283  GSM12298
## [1,] 10.823367 10.835971 10.810893 11.062653 10.323055 11.181028 11.566387
## [2,]  6.556123  8.207014  6.816344  6.563768  7.353147  5.770829  6.912889
## [3,]  7.708739  7.428779  7.754888  7.126188  8.742815  7.339850  7.602142
## [4,] 12.336534 11.762839 11.237509 12.412490 11.213408 12.678380 12.232901
## [5,]  5.501439  6.247928  6.017922  6.525129  6.683696  5.918863  5.837943
##       GSM12300  GSM12399  GSM12444
## [1,] 11.078151 11.535178 11.105450
## [2,]  4.812498  7.471675  7.488644
## [3,]  7.383704  7.432959  7.381110
## [4,] 12.090939 11.421802 12.172834
## [5,]  6.281698  5.419539  5.469235

构造ExpressionSet

require(Biobase)
# go through the necessary steps to make a compliant ExpressionSet
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(gsmlist)
pdata <- data.frame(samples=names(gsmlist))
rownames(pdata) <- names(gsmlist)
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 17 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
##   varLabels: samples
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

5 Accessing Raw Data from GEO

getGEOSuppFiles函数

6 Use Cases

##6.1 Getting all Series Records for a Given Platform

获取GPL相关的所有GSE 和 GSM accessions

gpl97 <- getGEO('GPL97')
Meta(gpl97)$title
## [1] "[HG-U133B] Affymetrix Human Genome U133B Array"
head(Meta(gpl97)$series_id)
## [1] "GSE362" "GSE473" "GSE620" "GSE674" "GSE781" "GSE907"
length(Meta(gpl97)$series_id)
## [1] 165
head(Meta(gpl97)$sample_id)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930" "GSM3932"
length(Meta(gpl97)$sample_id)
## [1] 7917

gsmids <- Meta(gpl97)$sample_id
gsmlist <- sapply(gsmids[1:5],getGEO)
names(gsmlist)
## [1] "GSM3922" "GSM3924" "GSM3926" "GSM3928" "GSM3930"

英文版原文见：[Using the GEOquery Package

基因芯片数据分析-1: 使用GEOquery 包从GEO获取数据