如何将TCGA文件ID(UUID)转换为TCGA 样本ID(Barcode)

我遇到的问题:从TCGA GDC把表达谱的数据下载到服务器后,发现表达矩阵有两列,一列是ENSG号,一列是表达值,而文件名为一串数字,虽然在manifest文件里可以找到每个文件名所对应的TCGA UUID,但是找不到UUID所对应的样本名,那就不知道哪个表达谱是哪个样本的了。。。

所以我们要实现根据表达矩阵的UUID找到对应的样本名,我搜索到的一些中文教程(比如从TCGA数据库下载并整合清洗高通量肿瘤表达谱-临床性状数据)是从 manifest 文件里获得的样本名,但是我所下载的manifest文件根本没有样本名,只有矩阵文件名和对应的UUID

因此试了下全英文搜索,搜到了几篇,方法都是差不多的,但有的行有的不行,经过一番测试得到了最终的解决方案,特此记录

参考文章:GenomicDataCommons Example: UUID to TCGA and TARGET Barcode Translation

我演示的主文件夹为test_gdc,以三个样本为例,分别为:TCGA-G3-A3CGTCGA-2Y-A9H4TCGA-BC-A112,获取这三个样本的manifest文件,使用gdc-client工具下载数据,每个样本包含3中表达谱矩阵,所以一共9个文件

不会下载的朋友可以参考:TCGA 数据下载工具 -- gdc-client

下载过程:

gdc-client download -m gdc_manifest_20210130_070310.txt -d out
100% [###############################################################################################################]
100% [####################################################################################] Time:  0:00:01 128.9 KiB/s
100% [####################################################################################] Time:  0:00:02 200.0 KiB/s
100% [####################################################################################] Time:  0:00:02 202.0 KiB/s
100% [####################################################################################] Time:  0:00:02 182.0 KiB/s
100% [####################################################################################] Time:  0:00:01 125.3 KiB/s
100% [####################################################################################] Time:  0:00:02 190.7 KiB/s
Successfully downloaded: 9
# 下载的表达矩阵位于out文件夹,这里列出所有表达矩阵
list.files("out", ".gz$", recursive = TRUE)
[1] "1c5adeed-f154-4b7c-902d-779c73c0f256/9dd14a4d-a863-4994-b5c6-b5bec92ff818.htseq.counts.gz"
[2] "49861d17-3b26-47ef-a8c6-7005be07a81d/9700948e-21f5-42ae-9cb5-f18c745887ce.htseq.counts.gz"
[3] "49dd5911-0309-4442-bdbe-762e0767f0c9/9700948e-21f5-42ae-9cb5-f18c745887ce.FPKM.txt.gz"    
[4] "50aedc81-a2e9-497b-9163-6d216819b850/9dd14a4d-a863-4994-b5c6-b5bec92ff818.FPKM-UQ.txt.gz" 
[5] "7726c67e-fa7b-4ea6-bb0b-62173aee8a4d/9dd14a4d-a863-4994-b5c6-b5bec92ff818.FPKM.txt.gz"    
[6] "7788f2a7-9755-4c9d-8ac2-0aa8e91f2fee/b7e1f458-b426-4e4e-b4db-e0472396198b.FPKM-UQ.txt.gz" 
[7] "8bf550b9-0ae8-472a-9215-115c91f8593b/b7e1f458-b426-4e4e-b4db-e0472396198b.htseq.counts.gz"
[8] "95268ab8-e4b2-48c2-8a6f-345536031ff8/9700948e-21f5-42ae-9cb5-f18c745887ce.FPKM-UQ.txt.gz" 
[9] "f012dc4b-ad50-4412-98d6-fc97147f87b2/b7e1f458-b426-4e4e-b4db-e0472396198b.FPKM.txt.gz"

# 接下来以只整合FPKM文件作为演示

# process FPKM files
suppressPackageStartupMessages(library("R.utils"))
dir.create("FPKM")
manifest = read.table("gdc_manifest_20210130_070310.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
FPKM = list.files("out", ".FPKM.txt.gz$", recursive = TRUE, full.names = TRUE)
> FPKM
[1] "out/49dd5911-0309-4442-bdbe-762e0767f0c9/9700948e-21f5-42ae-9cb5-f18c745887ce.FPKM.txt.gz"
[2] "out/7726c67e-fa7b-4ea6-bb0b-62173aee8a4d/9dd14a4d-a863-4994-b5c6-b5bec92ff818.FPKM.txt.gz"
[3] "out/f012dc4b-ad50-4412-98d6-fc97147f87b2/b7e1f458-b426-4e4e-b4db-e0472396198b.FPKM.txt.gz"
Ns = length(FPKM)

for(i in 1:Ns){
    
    cat("Remain", Ns - i, "\n")
    gzFile = FPKM[i]
    gunzip(gzFile, remove = FALSE)
    fileIn = gsub(".gz", "", gzFile)
    xExpM = read.table(fileIn)
    values = xExpM[, 2]
    if(i == 1){
        expM = xExpM
    } else{
        expM = cbind(expM, values)
    }
}

# file.remove(gsub(".gz", "", FPKM))
expM[1:10, 1:4]
                   V1           V2       values     values.1
1   ENSG00000242268.2  0.000000000  0.079969868 2.931706e-02
2   ENSG00000270112.3  0.000000000  0.000000000 4.067835e-03
3  ENSG00000167578.15  1.339747823  3.562183318 2.490200e+00
4   ENSG00000273842.1  0.000000000  0.000000000 0.000000e+00
5   ENSG00000078237.5  2.804457587  2.629508721 1.872878e+00
6  ENSG00000146083.10  2.782951753  4.870320370 2.793491e+01
7   ENSG00000225275.4  0.000000000  0.000000000 0.000000e+00
8  ENSG00000158486.12  0.003219903  0.001897681 2.212301e-01
9  ENSG00000198242.12 46.515190225 65.353703728 2.916032e+02
10  ENSG00000259883.1  0.156192432  0.110464214 0.000000e+00

rownames(expM) = expM[, 1]
expM = expM[, -1]
expM[1:10, 1:3]
                             V2       values     values.1
ENSG00000242268.2   0.000000000  0.079969868 2.931706e-02
ENSG00000270112.3   0.000000000  0.000000000 4.067835e-03
ENSG00000167578.15  1.339747823  3.562183318 2.490200e+00
ENSG00000273842.1   0.000000000  0.000000000 0.000000e+00
ENSG00000078237.5   2.804457587  2.629508721 1.872878e+00
ENSG00000146083.10  2.782951753  4.870320370 2.793491e+01
ENSG00000225275.4   0.000000000  0.000000000 0.000000e+00
ENSG00000158486.12  0.003219903  0.001897681 2.212301e-01
ENSG00000198242.12 46.515190225 65.353703728 2.916032e+02
ENSG00000259883.1   0.156192432  0.110464214 0.000000e+00

# UUID to TCGA barcode
UUID = manifest$id[match(basename(FPKM), manifest$filename)]
UUID
# BiocManager::install("GenomicDataCommons")
suppressPackageStartupMessages(library("GenomicDataCommons"))

TCGAtranslateID = function(file_ids, legacy = FALSE) {
    info = files(legacy = legacy) %>%
        filter( ~ file_id %in% file_ids) %>%
        select('cases.samples.submitter_id') %>%
        results_all()
    # The mess of code below is to extract TCGA barcodes
    # id_list will contain a list (one item for each file_id)
    # of TCGA barcodes of the form 'TCGA-XX-YYYY-ZZZ'
    id_list = lapply(info$cases,function(a) {
        a[[1]][[1]][[1]]})
    # so we can later expand to a data.frame of the right size
    barcodes_per_file = sapply(id_list,length)
    # And build the data.frame
    return(data.frame(file_id = rep(ids(info),barcodes_per_file),
                      submitter_id = unlist(id_list)))
}
TCGA_barcode = as.character(TCGAtranslateID(UUID)[, 2])
colnames(expM) = TCGA_barcode
expM[1:10, 1:3]
                   TCGA-G3-A3CG-01A TCGA-2Y-A9H4-01A TCGA-BC-A112-01A
ENSG00000242268.2       0.000000000      0.079969868     2.931706e-02
ENSG00000270112.3       0.000000000      0.000000000     4.067835e-03
ENSG00000167578.15      1.339747823      3.562183318     2.490200e+00
ENSG00000273842.1       0.000000000      0.000000000     0.000000e+00
ENSG00000078237.5       2.804457587      2.629508721     1.872878e+00
ENSG00000146083.10      2.782951753      4.870320370     2.793491e+01
ENSG00000225275.4       0.000000000      0.000000000     0.000000e+00
ENSG00000158486.12      0.003219903      0.001897681     2.212301e-01
ENSG00000198242.12     46.515190225     65.353703728     2.916032e+02
ENSG00000259883.1       0.156192432      0.110464214     0.000000e+00

又经过两天点的学习,发现文件名和文件ID以及样本ID的对照表是可以下载到的,就是metadata文件,但是我目前发现只能从购物车中下载,所以必须要添加先到购物车,再下载:

从这个metadata文件获取文件名和文件ID以及样本ID的对照表非常简单,索性写了一个小函数:

getIDtable <- function(metaFile){
  
  suppressPackageStartupMessages(library("jsonlite"))
  meta = jsonlite::fromJSON(metaFile)
  IDtable = meta[, c("associated_entities", "file_name", "file_id")]
  TCGA_barcode = sapply(IDtable$associated_entities, function(x) x[[1]][1])
  IDtable[, 1] = TCGA_barcode
  colnames(IDtable)[1] = "barcode"
  return(IDtable)
}

# let's use it
IDtable = getIDtable("samplesheet/FPKM_metadata.cart.2021-02-01.json")
IDtable[1:5, ]
                       barcode                                        file_name                              file_id
1 TCGA-DD-A1EJ-11A-11R-A155-07 7a3a131b-883d-4f82-8b3b-ede7733f68d8.FPKM.txt.gz 95e6e420-6b86-4034-aa6c-369c38c8840a
2 TCGA-DD-A4NQ-01A-21R-A28V-07 51c2d807-7ee6-4b42-a6a7-3eb14f987bc0.FPKM.txt.gz c8546523-a711-4b5f-97ff-7a3c6ca9413f
3 TCGA-G3-A5SM-01A-12R-A28V-07 fdb62f73-33a7-44c3-950c-739383b9dd30.FPKM.txt.gz e62a1625-73f9-49e4-9922-d15a6e18ee72
4 TCGA-MI-A75E-01A-11R-A32O-07 932b63bf-d723-40dd-a5f5-21830b8ea06e.FPKM.txt.gz 72accbef-4357-45e3-9d31-5dd4eb8d3ded
5 TCGA-DD-A11D-01A-11R-A131-07 9ccd91ba-7180-443a-b03b-f9b398c679e4.FPKM.txt.gz 2a88aff5-a29d-434a-9859-c60f47bcf75e

有个这个table,麻麻再也不用担心我找不到样本ID了~


觉得有用的老铁麻烦点个小爱心~

你可能感兴趣的:(如何将TCGA文件ID(UUID)转换为TCGA 样本ID(Barcode))