我遇到的问题:从TCGA GDC把表达谱的数据下载到服务器后,发现表达矩阵有两列,一列是ENSG号,一列是表达值,而文件名为一串数字,虽然在manifest文件里可以找到每个文件名所对应的TCGA UUID,但是找不到UUID所对应的样本名,那就不知道哪个表达谱是哪个样本的了。。。
所以我们要实现根据表达矩阵的UUID找到对应的样本名,我搜索到的一些中文教程(比如从TCGA数据库下载并整合清洗高通量肿瘤表达谱-临床性状数据)是从 manifest 文件里获得的样本名,但是我所下载的manifest文件根本没有样本名,只有矩阵文件名和对应的UUID
因此试了下全英文搜索,搜到了几篇,方法都是差不多的,但有的行有的不行,经过一番测试得到了最终的解决方案,特此记录
参考文章:GenomicDataCommons Example: UUID to TCGA and TARGET Barcode Translation
我演示的主文件夹为test_gdc,以三个样本为例,分别为:TCGA-G3-A3CG
、TCGA-2Y-A9H4
、TCGA-BC-A112
,获取这三个样本的manifest文件,使用gdc-client工具下载数据,每个样本包含3中表达谱矩阵,所以一共9个文件
不会下载的朋友可以参考:TCGA 数据下载工具 -- gdc-client
下载过程:
gdc-client download -m gdc_manifest_20210130_070310.txt -d out
100% [###############################################################################################################]
100% [####################################################################################] Time: 0:00:01 128.9 KiB/s
100% [####################################################################################] Time: 0:00:02 200.0 KiB/s
100% [####################################################################################] Time: 0:00:02 202.0 KiB/s
100% [####################################################################################] Time: 0:00:02 182.0 KiB/s
100% [####################################################################################] Time: 0:00:01 125.3 KiB/s
100% [####################################################################################] Time: 0:00:02 190.7 KiB/s
Successfully downloaded: 9
# 下载的表达矩阵位于out文件夹,这里列出所有表达矩阵
list.files("out", ".gz$", recursive = TRUE)
[1] "1c5adeed-f154-4b7c-902d-779c73c0f256/9dd14a4d-a863-4994-b5c6-b5bec92ff818.htseq.counts.gz"
[2] "49861d17-3b26-47ef-a8c6-7005be07a81d/9700948e-21f5-42ae-9cb5-f18c745887ce.htseq.counts.gz"
[3] "49dd5911-0309-4442-bdbe-762e0767f0c9/9700948e-21f5-42ae-9cb5-f18c745887ce.FPKM.txt.gz"
[4] "50aedc81-a2e9-497b-9163-6d216819b850/9dd14a4d-a863-4994-b5c6-b5bec92ff818.FPKM-UQ.txt.gz"
[5] "7726c67e-fa7b-4ea6-bb0b-62173aee8a4d/9dd14a4d-a863-4994-b5c6-b5bec92ff818.FPKM.txt.gz"
[6] "7788f2a7-9755-4c9d-8ac2-0aa8e91f2fee/b7e1f458-b426-4e4e-b4db-e0472396198b.FPKM-UQ.txt.gz"
[7] "8bf550b9-0ae8-472a-9215-115c91f8593b/b7e1f458-b426-4e4e-b4db-e0472396198b.htseq.counts.gz"
[8] "95268ab8-e4b2-48c2-8a6f-345536031ff8/9700948e-21f5-42ae-9cb5-f18c745887ce.FPKM-UQ.txt.gz"
[9] "f012dc4b-ad50-4412-98d6-fc97147f87b2/b7e1f458-b426-4e4e-b4db-e0472396198b.FPKM.txt.gz"
# 接下来以只整合FPKM文件作为演示
# process FPKM files
suppressPackageStartupMessages(library("R.utils"))
dir.create("FPKM")
manifest = read.table("gdc_manifest_20210130_070310.txt", header = TRUE, sep = "\t", stringsAsFactors = FALSE)
FPKM = list.files("out", ".FPKM.txt.gz$", recursive = TRUE, full.names = TRUE)
> FPKM
[1] "out/49dd5911-0309-4442-bdbe-762e0767f0c9/9700948e-21f5-42ae-9cb5-f18c745887ce.FPKM.txt.gz"
[2] "out/7726c67e-fa7b-4ea6-bb0b-62173aee8a4d/9dd14a4d-a863-4994-b5c6-b5bec92ff818.FPKM.txt.gz"
[3] "out/f012dc4b-ad50-4412-98d6-fc97147f87b2/b7e1f458-b426-4e4e-b4db-e0472396198b.FPKM.txt.gz"
Ns = length(FPKM)
for(i in 1:Ns){
cat("Remain", Ns - i, "\n")
gzFile = FPKM[i]
gunzip(gzFile, remove = FALSE)
fileIn = gsub(".gz", "", gzFile)
xExpM = read.table(fileIn)
values = xExpM[, 2]
if(i == 1){
expM = xExpM
} else{
expM = cbind(expM, values)
}
}
# file.remove(gsub(".gz", "", FPKM))
expM[1:10, 1:4]
V1 V2 values values.1
1 ENSG00000242268.2 0.000000000 0.079969868 2.931706e-02
2 ENSG00000270112.3 0.000000000 0.000000000 4.067835e-03
3 ENSG00000167578.15 1.339747823 3.562183318 2.490200e+00
4 ENSG00000273842.1 0.000000000 0.000000000 0.000000e+00
5 ENSG00000078237.5 2.804457587 2.629508721 1.872878e+00
6 ENSG00000146083.10 2.782951753 4.870320370 2.793491e+01
7 ENSG00000225275.4 0.000000000 0.000000000 0.000000e+00
8 ENSG00000158486.12 0.003219903 0.001897681 2.212301e-01
9 ENSG00000198242.12 46.515190225 65.353703728 2.916032e+02
10 ENSG00000259883.1 0.156192432 0.110464214 0.000000e+00
rownames(expM) = expM[, 1]
expM = expM[, -1]
expM[1:10, 1:3]
V2 values values.1
ENSG00000242268.2 0.000000000 0.079969868 2.931706e-02
ENSG00000270112.3 0.000000000 0.000000000 4.067835e-03
ENSG00000167578.15 1.339747823 3.562183318 2.490200e+00
ENSG00000273842.1 0.000000000 0.000000000 0.000000e+00
ENSG00000078237.5 2.804457587 2.629508721 1.872878e+00
ENSG00000146083.10 2.782951753 4.870320370 2.793491e+01
ENSG00000225275.4 0.000000000 0.000000000 0.000000e+00
ENSG00000158486.12 0.003219903 0.001897681 2.212301e-01
ENSG00000198242.12 46.515190225 65.353703728 2.916032e+02
ENSG00000259883.1 0.156192432 0.110464214 0.000000e+00
# UUID to TCGA barcode
UUID = manifest$id[match(basename(FPKM), manifest$filename)]
UUID
# BiocManager::install("GenomicDataCommons")
suppressPackageStartupMessages(library("GenomicDataCommons"))
TCGAtranslateID = function(file_ids, legacy = FALSE) {
info = files(legacy = legacy) %>%
filter( ~ file_id %in% file_ids) %>%
select('cases.samples.submitter_id') %>%
results_all()
# The mess of code below is to extract TCGA barcodes
# id_list will contain a list (one item for each file_id)
# of TCGA barcodes of the form 'TCGA-XX-YYYY-ZZZ'
id_list = lapply(info$cases,function(a) {
a[[1]][[1]][[1]]})
# so we can later expand to a data.frame of the right size
barcodes_per_file = sapply(id_list,length)
# And build the data.frame
return(data.frame(file_id = rep(ids(info),barcodes_per_file),
submitter_id = unlist(id_list)))
}
TCGA_barcode = as.character(TCGAtranslateID(UUID)[, 2])
colnames(expM) = TCGA_barcode
expM[1:10, 1:3]
TCGA-G3-A3CG-01A TCGA-2Y-A9H4-01A TCGA-BC-A112-01A
ENSG00000242268.2 0.000000000 0.079969868 2.931706e-02
ENSG00000270112.3 0.000000000 0.000000000 4.067835e-03
ENSG00000167578.15 1.339747823 3.562183318 2.490200e+00
ENSG00000273842.1 0.000000000 0.000000000 0.000000e+00
ENSG00000078237.5 2.804457587 2.629508721 1.872878e+00
ENSG00000146083.10 2.782951753 4.870320370 2.793491e+01
ENSG00000225275.4 0.000000000 0.000000000 0.000000e+00
ENSG00000158486.12 0.003219903 0.001897681 2.212301e-01
ENSG00000198242.12 46.515190225 65.353703728 2.916032e+02
ENSG00000259883.1 0.156192432 0.110464214 0.000000e+00
又经过两天点的学习,发现文件名和文件ID以及样本ID的对照表是可以下载到的,就是metadata文件,但是我目前发现只能从购物车中下载,所以必须要添加先到购物车,再下载:
从这个metadata文件获取文件名和文件ID以及样本ID的对照表非常简单,索性写了一个小函数:
getIDtable <- function(metaFile){
suppressPackageStartupMessages(library("jsonlite"))
meta = jsonlite::fromJSON(metaFile)
IDtable = meta[, c("associated_entities", "file_name", "file_id")]
TCGA_barcode = sapply(IDtable$associated_entities, function(x) x[[1]][1])
IDtable[, 1] = TCGA_barcode
colnames(IDtable)[1] = "barcode"
return(IDtable)
}
# let's use it
IDtable = getIDtable("samplesheet/FPKM_metadata.cart.2021-02-01.json")
IDtable[1:5, ]
barcode file_name file_id
1 TCGA-DD-A1EJ-11A-11R-A155-07 7a3a131b-883d-4f82-8b3b-ede7733f68d8.FPKM.txt.gz 95e6e420-6b86-4034-aa6c-369c38c8840a
2 TCGA-DD-A4NQ-01A-21R-A28V-07 51c2d807-7ee6-4b42-a6a7-3eb14f987bc0.FPKM.txt.gz c8546523-a711-4b5f-97ff-7a3c6ca9413f
3 TCGA-G3-A5SM-01A-12R-A28V-07 fdb62f73-33a7-44c3-950c-739383b9dd30.FPKM.txt.gz e62a1625-73f9-49e4-9922-d15a6e18ee72
4 TCGA-MI-A75E-01A-11R-A32O-07 932b63bf-d723-40dd-a5f5-21830b8ea06e.FPKM.txt.gz 72accbef-4357-45e3-9d31-5dd4eb8d3ded
5 TCGA-DD-A11D-01A-11R-A131-07 9ccd91ba-7180-443a-b03b-f9b398c679e4.FPKM.txt.gz 2a88aff5-a29d-434a-9859-c60f47bcf75e
有个这个table,麻麻再也不用担心我找不到样本ID了~
觉得有用的老铁麻烦点个小爱心~