ID转换那点事

在进行测序数据下游分析的时候常常需要用到不同的数据库,而这些数据库的分析的输入文件经常是各有区别,因此我们常常需要将各类ID(ensamble ID、Entrez ID、gene symbol等)进行转换。这里我们将简单介绍使用编程进行ID转换的方法。

芯片ID注释

芯片注释主要有两种方法:使用R包进行注释或在数据库中下载芯片平台信息然后手动注释,这两种方法其实基本都是相同的,都是利用芯片平台信息进行注释,区别只在于一个是利用R包API而另一个则是自己写代码。

载入数据:
> library('GEOquery')
> gse20986 <- getGEO("GSE20986", destdir=".")  # 下载处理好的表达矩阵
> exprmt <- exprs(gse20986[[1]]);head(exprmt);class(exprmt)
          GSM524662 GSM524663 GSM524664 GSM524665 GSM524666 GSM524667 GSM524668 GSM524669 GSM524670 GSM524671 GSM524672 GSM524673
1007_s_at  2.867331  3.941245  4.003509  3.230621  3.546867  2.507474  3.040758  2.456892  2.404267  3.568920  3.307409  3.225964
1053_at   10.503022  8.320770  9.523183 10.645904  8.140238 10.228986 11.019461 10.447158 10.799832  9.765222 10.069229 10.089496
117_at     2.702517  2.749144  3.226281  2.788124  4.664960  2.722543  2.745330  4.500872  2.702614  3.015139  2.768683  2.853809
121_at     3.052316  3.057630  3.056844  3.697705  3.057630  3.057630  3.057630  3.057630  3.072568  3.057630  3.057630  3.057630
1255_g_at  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998
1294_at    5.360226  4.465648  5.914568  5.558741  5.241448  5.037804  4.784192  4.978118  4.713441  4.303196  4.452989  4.766019
[1] "matrix"
# 将矩阵转换为数据框方便后面注释操作,并将行名称装换为ID
> exprmt <- as.data.frame(exprmt); exprmt$ID <- rownames(exprmt); head(exprmt)
          GSM524662 GSM524663 GSM524664 GSM524665 GSM524666 GSM524667 GSM524668 GSM524669 GSM524670 GSM524671 GSM524672 GSM524673        ID
1007_s_at  2.867331  3.941245  4.003509  3.230621  3.546867  2.507474  3.040758  2.456892  2.404267  3.568920  3.307409  3.225964 1007_s_at
1053_at   10.503022  8.320770  9.523183 10.645904  8.140238 10.228986 11.019461 10.447158 10.799832  9.765222 10.069229 10.089496   1053_at
117_at     2.702517  2.749144  3.226281  2.788124  4.664960  2.722543  2.745330  4.500872  2.702614  3.015139  2.768683  2.853809    117_at
121_at     3.052316  3.057630  3.056844  3.697705  3.057630  3.057630  3.057630  3.057630  3.072568  3.057630  3.057630  3.057630    121_at
1255_g_at  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998  2.278998 1255_g_at
1294_at    5.360226  4.465648  5.914568  5.558741  5.241448  5.037804  4.784192  4.978118  4.713441  4.303196  4.452989  4.766019   1294_at
使用annotate包注释芯片

在数据集GSE20986页面可以看到芯片平台信息:

芯片平台信息

当然也可以直接通过代码获得芯片平台信息:

> gse20986
$GSE20986_series_matrix.txt.gz
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54675 features, 12 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM524662 GSM524663 ... GSM524673 (12 total)
  varLabels: title geo_accession ... tissue:ch1 (34 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL570 

知道了芯片平台信息,接下来我们就可以使用annotate包进行注释了。
先查看一下有没有重复的芯片ID:

> dupID <- exprmt[duplicated(exprmt$ID),]; head(dupID); dim(dupID)
 [1] GSM524662 GSM524663 GSM524664 GSM524665 GSM524666 GSM524667 GSM524668 GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID       
<0 行> (或0-长度的row.names)
[1]  0 13

没有重复,直接进行注释

> biocLite("hgu133plus2.db")
> library(hgu133plus2.db)
> library(annotate)

# 利用getSYMBOL函数进行注释,这里只选择后几列额,方便观察
> exprmt$symbol <- getSYMBOL(exprmt$ID, 'hgu133plus2');exprmt <- exprmt[,8:14]; head(exprmt)
          GSM524669 GSM524670 GSM524671 GSM524672 GSM524673        ID symbol
1007_s_at  2.456892  2.404267  3.568920  3.307409  3.225964 1007_s_at   
1053_at   10.447158 10.799832  9.765222 10.069229 10.089496   1053_at   RFC2
117_at     4.500872  2.702614  3.015139  2.768683  2.853809    117_at  HSPA6
121_at     3.057630  3.072568  3.057630  3.057630  3.057630    121_at   PAX8
1255_g_at  2.278998  2.278998  2.278998  2.278998  2.278998 1255_g_at GUCA1A
1294_at    4.978118  4.713441  4.303196  4.452989  4.766019   1294_at   

可以看到有一些芯片ID是没有注释上的,接下来对注释情况进行检查,主要包括三个方面:(1)未注释上--NA;(2)未注释上--空字符串;(3)注释上了,但有重复。

# 查看一共多少行
> dim(exprmt)
[1] 54675     7
# symbol非NA的行
> exprmt.nona <- exprmt[!is.na(exprmt$symbol),]; dim(exprmt.nona)
[1] 41941     7
> dim(exprmt[is.na(exprmt$symbol),])
[1] 12734     7
# symbol重复的行数
> symbol.dup <- exprmt.nona[duplicated(exprmt.nona$symbol),]; head(symbol.dup)
             GSM524669 GSM524670 GSM524671 GSM524672 GSM524673           ID   symbol
1552264_a_at  8.016109  7.672156  7.661139  7.835451  7.225329 1552264_a_at    MAPK1
1552272_a_at  2.279247  2.279247  2.279247  2.279247  2.279247 1552272_a_at    PRR22
1552275_s_at  4.862782  5.539661  4.672476  4.709687  3.931607 1552275_s_at      PXK
1552279_a_at  2.436048  2.731219  2.436048  2.436048  2.464805 1552279_a_at  SLC46A1
1552289_a_at  2.278998  2.278998  2.278998  2.278998  2.278998 1552289_a_at    CILP2
1552303_a_at  4.421032  4.601013  4.614977  4.628660  4.628942 1552303_a_at TMEM106A
> dim(symbol.dup)
[1] 21749     7
> exprmt.nona[exprmt.nona$symbol=='PXK',]
             GSM524669 GSM524670 GSM524671 GSM524672 GSM524673           ID symbol
1552274_at    5.290347  5.539950  5.081292  5.016670  5.032546   1552274_at    PXK
1552275_s_at  4.862782  5.539661  4.672476  4.709687  3.931607 1552275_s_at    PXK
225796_at     7.140168  7.106552  7.232615  6.650670  7.167075    225796_at    PXK
238223_at     2.281464  2.281464  2.394397  2.281464  2.281464    238223_at    PXK
# 最后看看有没有空字符串
> exprmt.nona[exprmt.nona$symbol=='',]
[1] GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 ID        symbol   
<0 行> (或0-长度的row.names)

# 最后得到非重复的,无NA的数据
> exprmt.final <- exprmt.nona[!duplicated(exprmt.nona$symbol),]; dim(exprmt.final)
[1] 20192     7
> exprmt.final[exprmt.final$symbol=='PXK',]
           GSM524669 GSM524670 GSM524671 GSM524672 GSM524673         ID symbol
1552274_at  5.290347   5.53995  5.081292   5.01667  5.032546 1552274_at    PXK

从上面可以看到,我们的芯片数据共有54675个features,其中注释到gene symbol的为41941个,在这些注释到的features中,有21749个是重复的。无空字符串gene symbol。最后得到非重复数据20192个(重复的则取其第一个出现的)。

下面我们下载芯片平台对应信息文件手动进行注释:
# 下载芯片对应信息
> gpl570 <- getGEO('GPL570', destdir=".")
> names(Meta(gpl570))
 [1] "contact_address"         "contact_city"            "contact_country"         "contact_email"           "contact_institute"       "contact_name"           
 [7] "contact_phone"           "contact_state"           "contact_web_link"        "contact_zip/postal_code" "data_row_count"          "description"            
[13] "distribution"            "geo_accession"           "last_update_date"        "manufacture_protocol"    "manufacturer"            "organism"               
[19] "relation"                "sample_id"               "series_id"               "status"                  "submission_date"         "taxid"                  
[25] "technology"              "title"                   "web_link"     
> colnames(Table(gpl570))
 [1] "ID"                               "GB_ACC"                           "SPOT_ID"                          "Species Scientific Name"         
 [5] "Annotation Date"                  "Sequence Type"                    "Sequence Source"                  "Target Description"              
 [9] "Representative Public ID"         "Gene Title"                       "Gene Symbol"                      "ENTREZ_GENE_ID"                  
[13] "RefSeq Transcript ID"             "Gene Ontology Biological Process" "Gene Ontology Cellular Component" "Gene Ontology Molecular Function"

# 选取其中的几列用来转换
> iddata <- Table(gpl570)[c('ID','Gene Symbol')]
> head(iddata)
         ID      Gene Symbol
1 1007_s_at DDR1 /// MIR4640
2   1053_at             RFC2
3    117_at            HSPA6
4    121_at             PAX8
5 1255_g_at           GUCA1A
6   1294_at MIR5193 /// UBA7

我们选取gpl信息中的ID、Gene Symbol用于转换。

> mann.anno <- merge(x=exprmt, y=iddata, by='ID',all.x=T,all.y=F); head(mann.anno); dim(mann.anno)
         ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP      Gene Symbol
1 1007_s_at  2.456892  2.404267  3.568920  3.307409  3.225964       DDR1 /// MIR4640
2   1053_at 10.447158 10.799832  9.765222 10.069229 10.089496      RFC2             RFC2
3    117_at  4.500872  2.702614  3.015139  2.768683  2.853809     HSPA6            HSPA6
4    121_at  3.057630  3.072568  3.057630  3.057630  3.057630      PAX8             PAX8
5 1255_g_at  2.278998  2.278998  2.278998  2.278998  2.278998    GUCA1A           GUCA1A
6   1294_at  4.978118  4.713441  4.303196  4.452989  4.766019       MIR5193 /// UBA7

可以看到用R包注释的(symbol.RP列)和直接下载平台信息注释的有些许差别,R包未能注释到的,手动注释结果是对应多个信息。下面我们进一步查看一下其区别:

> library(dplyr)
> consistent <- filter(mann.anno, symbol.RP==`Gene Symbol`); head(consistent)
         ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
1   1053_at 10.447158 10.799832  9.765222 10.069229 10.089496      RFC2        RFC2
2    117_at  4.500872  2.702614  3.015139  2.768683  2.853809     HSPA6       HSPA6
3    121_at  3.057630  3.072568  3.057630  3.057630  3.057630      PAX8        PAX8
4 1255_g_at  2.278998  2.278998  2.278998  2.278998  2.278998    GUCA1A      GUCA1A
5   1316_at  5.246463  5.547843  5.269052  4.794183  4.620719      THRA        THRA
6   1320_at  4.578203  4.446357  4.492585  4.502381  4.446357    PTPN21      PTPN21
> dim(mann.anno); dim(consistent)
[1] 54675     8
[1] 39145     8
# 可以看到一共有39145行注释结果都是相同的,那么这些中有多少是都未注释到的呢(即NA)
> consistent.na <- filter(consistent, is.na(symbol.RP));head(consistent.na)
[1] ID          GSM524669   GSM524670   GSM524671   GSM524672   GSM524673   symbol.RP   Gene Symbol
<0 行> (或0-长度的row.names)
# 结果是这些相同的中没有NA
# 再来看看注释到重复gene的情况吧
> dup <- consistent[duplicated(consistent$symbol.RP),]
> head(dup);dim(dup)
             ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP Gene Symbol
16 1552264_a_at  8.016109  7.672156  7.661139  7.835451  7.225329     MAPK1       MAPK1
20 1552272_a_at  2.279247  2.279247  2.279247  2.279247  2.279247     PRR22       PRR22
22 1552275_s_at  4.862782  5.539661  4.672476  4.709687  3.931607       PXK         PXK
26 1552279_a_at  2.436048  2.731219  2.436048  2.436048  2.464805   SLC46A1     SLC46A1
32 1552289_a_at  2.278998  2.278998  2.278998  2.278998  2.278998     CILP2       CILP2
40 1552303_a_at  4.421032  4.601013  4.614977  4.628660  4.628942  TMEM106A    TMEM106A
[1] 20267     8
# 因此最后非重复的
> dim(consistent[!duplicated(consistent$symbol.RP),])
[1] 18878     8

对两种注释结果相同的 gene symbol 进行分析,一共得到了39145个注释上的gene,这其中有20267个是重复的,因此最后得到了18878个非重复注释的gene。
我们最后再对手动注释的结果进行分析一下。

# 看看为注释到的情况(NA)
> mann.anno[is.na(mann.anno$`Gene Symbol`),]
[1] ID          GSM524669   GSM524670   GSM524671   GSM524672   GSM524673   symbol.RP   Gene Symbol
<0 行> (或0-长度的row.names)
# 没有注释不上的,看来R包注释不上的这里都注释到多个gene symbol了(如DDR1 /// MIR4640)
> library(stringr)
> multi.anno <- mann.anno[str_detect(mann.anno$`Gene Symbol`, '///'),]; head(multi.anno); dim(multi.anno)
              ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP                Gene Symbol
1      1007_s_at  2.456892  2.404267  3.568920  3.307409  3.225964                 DDR1 /// MIR4640
6        1294_at  4.978118  4.713441  4.303196  4.452989  4.766019                 MIR5193 /// UBA7
16    1552258_at  2.437344  2.458223  2.437344  2.437344  2.429044     CYTOR LINC00152 /// LOC101930489
32  1552283_s_at  2.278998  2.278998  2.278998  2.281762  2.278998             ZDHHC11 /// ZDHHC11B
97    1552388_at  2.550713  2.389557  2.389557  2.389557  2.389557              FLJ30901 /// SCUBE1
108 1552401_a_at  2.278998  2.278998  2.278998  2.294001  2.278998              BACH1 /// GRIK1-AS2
[1] 2796    8
# 这里2796个features注释到多gene symbol,但意外的是,有的多gene symbol位点R包也注释到了
# 看看dup情况怎么样
> mann.anno.rmdup <- mann.anno[!duplicated(mann.anno$`Gene Symbol`),]; head(mann.anno.rmdup); dim(mann.anno.rmdup)
         ID GSM524669 GSM524670 GSM524671 GSM524672 GSM524673 symbol.RP      Gene Symbol
1 1007_s_at  2.456892  2.404267  3.568920  3.307409  3.225964       DDR1 /// MIR4640
2   1053_at 10.447158 10.799832  9.765222 10.069229 10.089496      RFC2             RFC2
3    117_at  4.500872  2.702614  3.015139  2.768683  2.853809     HSPA6            HSPA6
4    121_at  3.057630  3.072568  3.057630  3.057630  3.057630      PAX8             PAX8
5 1255_g_at  2.278998  2.278998  2.278998  2.278998  2.278998    GUCA1A           GUCA1A
6   1294_at  4.978118  4.713441  4.303196  4.452989  4.766019       MIR5193 /// UBA7
[1] 23521     8

从上面可以看到所有芯片ID都注释到了,没有NA,其中没有重复的有23521个(其中2796个是多gene symbol,因此单个gene symbol 20725个),较R包注释到的(20276)要多一些。

ID转换

接下来我们将上面注释到的Gene symbol转换为Entrez ID,使用的是HGNC提供的ID对应文件。

> hgnc <- read.csv('D:/TCGA/hgnc_complete_set.txt',header = T,stringsAsFactors = F,sep='\t');colnames(hgnc)
 [1] "hgnc_id"                  "symbol"                   "name"                     "locus_group"              "locus_type"               "status"                  
 [7] "location"                 "location_sortable"        "alias_symbol"             "alias_name"               "prev_symbol"              "prev_name"               
[13] "gene_family"              "gene_family_id"           "date_approved_reserved"   "date_symbol_changed"      "date_name_changed"        "date_modified"           
[19] "entrez_id"                "ensembl_gene_id"          "vega_id"                  "ucsc_id"                  "ena"                      "refseq_accession"        
[25] "ccds_id"                  "uniprot_ids"              "pubmed_id"                "mgd_id"                   "rgd_id"                   "lsdb"                    
[31] "cosmic"                   "omim_id"                  "mirbase"                  "homeodb"                  "snornabase"               "bioparadigms_slc"        
[37] "orphanet"                 "pseudogene.org"           "horde_id"                 "merops"                   "imgt"                     "iuphar"                  
[43] "kznf_gene_catalog"        "mamit.trnadb"             "cd"                       "lncrnadb"                 "enzyme_id"                "intermediate_filament_db"
[49] "rna_central_ids"     

> destid <- hgnc[c('symbol','entrez_id','ensembl_gene_id')];head(destid)
    symbol entrez_id ensembl_gene_id
1     A1BG         1 ENSG00000121410
2 A1BG-AS1    503538 ENSG00000268895
3     A1CF     29974 ENSG00000148584
4      A2M         2 ENSG00000175899
5  A2M-AS1    144571 ENSG00000245105
6    A2ML1    144568 ENSG00000166535

# ID 转换
> trans <- merge(x=exprmt[!duplicated(exprmt$symbol.RP),], y=destid, by.x='symbol.RP',by.y='symbol');cat('exprmt.rmdup\n');dim(exprmt[!duplicated(exprmt$symbol.RP),]);cat('destid\n');dim(destid);cat('trans\n');dim(trans)
exprmt.rmdup
[1] 20193     7
destid
[1] 42508     3
trans
[1] 19062     9

> trans.rmdup <- trans[!duplicated(trans$symbol.RP),]; dim(trans.rmdup)
[1] 19061     9

> trans.rmdup <- trans[!duplicated(trans$entrez_id),]; dim(trans.rmdup)
[1] 19060     9

可以看到,使用HGNC提供的ID转换文件,我们共得到了19062个无重复ID,比上面注释到的少了快2000个,对于这些差异基因有机会在探究一下。

你可能感兴趣的:(ID转换那点事)