TCGA数据库介绍

转载:https://biozx.top/TCGA-introduce.html

简介

肿瘤基因组图谱 (TCGA) 计划由美国 National Cancer Institute(NCI) 和 National Human Genome Research Institute(NHGRI)于 2006 年联合启动的项目,目前共计研究 36 种癌症类型。

TCGA 利用大规模测序为主的基因组分析技术,通过广泛的合作,理解癌症的分子机制。提高人们对癌症发病分子基础的科学认识及提高我们诊断、治疗和预防癌症的能力。 最终完成一套完整的与所有癌症基因组改变相关的「图谱」。

library(TCGAbiolinks)
tmp<-getGDCprojects()
# TCGA 总共有如下40个project
tmp$project_id
 [1] "TCGA-READ"   "TCGA-THCA"   "TARGET-CCSK" "TCGA-MESO"   "TCGA-SARC"   "TARGET-AML"  "TCGA-LGG"   
 [8] "TARGET-NBL"  "TCGA-ACC"    "TCGA-CESC"   "TCGA-KIRP"   "TCGA-PAAD"   "TARGET-WT"   "TCGA-PCPG"  
[15] "TCGA-UCS"    "TCGA-LUAD"   "TCGA-BLCA"   "TCGA-OV"     "TCGA-CHOL"   "TCGA-SKCM"   "TCGA-GBM"   
[22] "TCGA-KIRC"   "TCGA-BRCA"   "TCGA-UCEC"   "TCGA-PRAD"   "TCGA-LAML"   "TCGA-STAD"   "TCGA-LUSC"  
[29] "TCGA-KICH"   "TCGA-TGCT"   "TCGA-DLBC"   "TCGA-THYM"   "TCGA-UVM"    "FM-AD"       "TARGET-OS"  
[36] "TCGA-HNSC"   "TCGA-ESCA"   "TCGA-COAD"   "TCGA-LIHC"   "TARGET-RT" 

数据类型

数据类型 说明
Clinical 病人的基本信息,诊断情况、TNM分期、肿瘤病理、生存情况等等
mRNA 由mRNA芯片或RNA-seq测得的mRNA表达量数据
microRNA 由microRNA芯片或RNA-seq测得的microRNA表达量数据
CopyNumber 由SNP芯片测序得到的肿瘤对比正常组织染色体各片段的比值
Mutation 肿瘤测序数据相对于参考基因组序列得到的核苷酸变化,包括插入、缺失等
Protein 由蛋白质芯片测序得到的200多种癌症的相关蛋白的表达量。
Methylation 由甲基化芯片测序得到的DNA甲基化程度

一、Clinical数据

TCGA临床数据有两种:

  • XML数据:包含的信息最全,包括啊辐射、药品信息、跟进、biospecimen等等信息。
  • indexed data:只包含最终的状态信息。例如:病人第一状态是alive的,接下来第二状态dead,则数据只包含dead记录。而XML则包含两个状态的信息。
indexed data下载
clinical <- GDCquery_clinic(project = "TCGA-LUAD", type = "clinical")  
datatable(clinical, filter = 'top', 
          options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),  
          rownames = FALSE)
XML数据下载
query <- GDCquery(project = "TCGA-COAD", 
                  data.category = "Clinical", 
                  barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))  
GDCdownload(query)  
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
datatable(clinical, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)  

二、mRNA表达量数据

数据文件有 (HTSeq count/ FPKM/ FPKM-UQ)3种

# 数据下载
query.exp.hg19 <- GDCquery(project = "TCGA-GBM",
                  data.category = "Gene expression",
                  data.type = "Gene expression quantification",
                  platform = "Illumina HiSeq", 
                  file.type  = "normalized_results",
                  experimental.strategy = "RNA-Seq",
                  barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
                  legacy = TRUE)
                  
datatable(getResults(query.exp.hg19), 
              filter = 'top',
              options = list(scrollX = TRUE, keys = TRUE, pageLength = 5), 
              rownames = FALSE)                 

三、microRNA数据

介绍链接

miRN数据主要记录了miRNA定量分析产生的数据。分析过程首先是reads比对(BWA),然后注释到mirbas v21和ucsc。这个分析只能注释mirbase有的miRNA,所以不能用于鉴定新的miRNA。

miRNA Expression Quantification

生成raw read counts数据记录==在mirnas.quantification.txt==文件中。多比对用cross-mapped列标注。文件中包括associates miRNA IDs with read count and a normalized count in reads-per-million-miRNA-mapped。

Isoform Expression Quantification

RPM counts记录在 ==isoforms==.quantification.txt文件中。文件中包括miRNA表达量定量分析中的所有列,除此之外还增加了isoforms的基因组坐标信息以及miRNA信息(前体或成熟&accession)

四、CopyNumber数据

使用Affymetrix SNP 6.0芯片,基于TCGA level 2 数据,最终生成txt文件,包含5列(片段名称,染色体,基因组位置,结合到芯片上的探针数量,seqment_mean)

library(TCGAbiolinks)
library(DT)
# 下载CopyNumber数据
query <- GDCquery(project = "TCGA-ACC", 
                  data.category = "Copy Number Variation",
                  data.type = "Copy Number Segment",
                  barcode = c( "TCGA-OR-A5KU-01A-11D-A29H-01", "TCGA-OR-A5JK-01A-11D-A29H-01"))
GDCdownload(query)
data <- GDCprepare(query)
datatable(data)

五、Methylation数据

包括以下几个平台:

  • Illumina Human Methylation 450
  • Illumina Human Methylation 27
  • Illumina DNA Methylation OMA003 CPI
  • Illumina DNA Methylation OMA002 CPI
  • Illumina Hi Seq

文件包括以下这些列:

列名 描述
Composite Element A unique ID for the array probe associated with a CpG site
Beta Value Represents the ratio between the methylated array intensity and total array intensity, falls between 0 (lower levels of methylation) and 1 (higher levels of methylation)
Chromosome The chromosome in which the probe binding site is located
Start The start of the CpG site on the chromosome
End The end of the CpG site on the chromosome
Gene Symbol The symbol for genes associated with the CpG site. Genes that fall within 1,500 bp upstream of the transcription start site (TSS) to the end of the gene body are used.
Gene Type A general classification for each gene (e.g. protein coding, miRNA, pseudogene)
Transcript ID Ensembl transcript IDs for each transcript associated with the genes detailed above
Position to TSS Distance in base pairs from the CpG site to each associated transcript's start site
CGI Coordinate The start and end coordinates of the CpG island associated with the CpG site
Feature Type The position of the CpG site in reference to the island: Island, N_Shore or S_Shore (0-2 kb upstream or downstream from CGI), or N_Shelf or S_Shelf (2-4 kbp upstream or downstream from CGI)
# 下载甲基化数据
query_met.hg38 <- GDCquery(project= "TCGA-LGG", 
                           data.category = "DNA Methylation", 
                           platform = "Illumina Human Methylation 450", 
                           barcode = c("TCGA-HT-8111-01A-11D-2399-05","TCGA-HT-A5R5-01A-11D-A28N-05"))
GDCdownload(query_met.hg38)
data.hg38 <- GDCprepare(query_met.hg38)
library(SummarizedExperiment)
datatable(as.data.frame(colData(data.hg38)))
datatable(assay(data.hg38)[1:10,])

数据水平

DataLevel LevelType 描述
1 原始数据BAM文件 包括单个样本的低水平数据、没有标准化的数据
2 处理过的数据 包括标准化后的单个样本数据
3 经过分割、解释的数据 包括来自单个样本的经过处理的数据的汇集、通过已探测的基因座的集合来形成较大的contig区域
4 感兴趣的区域或概要 包括量化跨各样本之间的关联、基于两个或多个数据的关联、分子异常及样本特征和临床变量

样本标签

样本标签 标签代码 标签描述
01 TP Primary solid Tumor
02 TR Recurrent Solid Tumor
03 TB Primary Blood Derived Cancer - Peripheral Blood
04 TRBM Recurrent Blood Derived Cancer - Bone Marrow
05 TAP Additional - New Primary
06 TM Metastatic
07 TAM Additional Metastatic
08 THOC Human Tumor Original Cells
09 TBM Primary Blood Derived Cancer - Bone Marrow
10 NB Blood Derived Normal
11 NT Solid Tissue Normal
12 NBC Buccal Cell Normal
13 NEBV EBV Immortalized Normal
14 NBM Bone Marrow Normal
20 CELLC Control Analyte
40 TRB Recurrent Blood Derived Cancer - Peripheral Blood
50 CELL Cell Lines
60 XP Primary Xenograft Tissue
61 XCL Cell Line Derived Xenograft Tissue

样本过滤

library(TCGAbiolinks)
bar <- c("TCGA-G9-6378-02A-11R-1789-07", "TCGA-CH-5767-04A-11R-1789-07",  
         "TCGA-G9-6332-60A-11R-1789-07", "TCGA-G9-6336-01A-11R-1789-07",
         "TCGA-G9-6336-11A-11R-1789-07", "TCGA-G9-7336-11A-11R-1789-07",
         "TCGA-G9-7336-04A-11R-1789-07", "TCGA-G9-7336-14A-11R-1789-07",
         "TCGA-G9-7036-04A-11R-1789-07", "TCGA-G9-7036-02A-11R-1789-07",
         "TCGA-G9-7036-11A-11R-1789-07", "TCGA-G9-7036-03A-11R-1789-07",
         "TCGA-G9-7036-10A-11R-1789-07", "TCGA-BH-A1ES-10A-11R-1789-07",
         "TCGA-BH-A1F0-10A-11R-1789-07", "TCGA-BH-A0BZ-02A-11R-1789-07",
         "TCGA-B6-A0WY-04A-11R-1789-07", "TCGA-BH-A1FG-04A-11R-1789-08",
         "TCGA-D8-A1JS-04A-11R-2089-08", "TCGA-AN-A0FN-11A-11R-8789-08",
         "TCGA-AR-A2LQ-12A-11R-8799-08", "TCGA-AR-A2LH-03A-11R-1789-07",
         "TCGA-BH-A1F8-04A-11R-5789-07", "TCGA-AR-A24T-04A-55R-1789-07",
         "TCGA-AO-A0J5-05A-11R-1789-07", "TCGA-BH-A0B4-11A-12R-1789-07",
         "TCGA-B6-A1KN-60A-13R-1789-07", "TCGA-AO-A0J5-01A-11R-1789-07",
         "TCGA-AO-A0J5-01A-11R-1789-07", "TCGA-G9-6336-11A-11R-1789-07",
         "TCGA-G9-6380-11A-11R-1789-07", "TCGA-G9-6380-01A-11R-1789-07",
         "TCGA-G9-6340-01A-11R-1789-07", "TCGA-G9-6340-11A-11R-1789-07")
         
# 筛选TP样本

TCGAquery_SampleTypes(bar,"TP")

[1] "TCGA-G9-6336-01A-11R-1789-07" "TCGA-AO-A0J5-01A-11R-1789-07" "TCGA-G9-6380-01A-11R-1789-07"

[4] "TCGA-G9-6340-01A-11R-1789-07"

# 筛选NB样本

TCGAquery_SampleTypes(bar,"NB")

[1] "TCGA-G9-7036-10A-11R-1789-07" "TCGA-BH-A1ES-10A-11R-1789-07" "TCGA-BH-A1F0-10A-11R-1789-07"

你可能感兴趣的:(TCGA数据库介绍)