cBioPortal 数据库 API 使用

前言

虽然 cBioPortal 数据库提供了很多交互式可视化图表来展示和探索不同组学癌症数据集中的基因组信息。但是，我们想要进行更深入的分析，就需要自己下载数据并进行个性化分析

cBioPortal 并没有一键式批量下载所有数据集的功能，只能选择对应的研究（study）一步步下载。当我们需要分析的数据集涉及多种癌型或想要进行泛癌分析时，这种方式还是偏繁琐。

因此，cBioPortal 为我们提供了 REST API 接口，我们可以通过代码来进行批量下载。接口使用的是 Swagger/OpenAPI 规范，该规范主要是用于描述 REST API，既可以作为文档给开发者阅读，也可以让机器根据这个文档自动生成客户端代码等

API 文档的链接地址：https://www.cbioportal.org/api/api-docs

文档是 json 格式的，里面详细描述了各函数及参数的格式和使用方式

使用这个 API 的方式较多，包括 R 和 Python：

R

cBioPortalData: 推荐使用这个包
rapiclient: 也可以用这个包，解析 API 文档

library(rapiclient)
client <- get_api(url = "https://www.cbioportal.org/api/api-docs")

CGDSR：不推荐使用，将会废弃

Python

bravado

from bravado.client import SwaggerClient
cbioportal = SwaggerClient.from_url(
    'https://www.cbioportal.org/api/api-docs',
    config={ 
        "validate_requests": False,
        "validate_responses": False
    }
)

我们分别介绍 cBioPortalData 包和 bravado 包的使用。如果你使用的是其他编程语言，只要使用对应的能解析 OpenAPI 规范的 API 文档的包就行

cBioPortalData

安装导入

BiocManager::install("cBioPortalData")

library(cBioPortalData)

1. 数据结构

cBioPortalData 会将每一个 study 保存为一个 MultiAssayExperiment 结构的对象（S4）

例如，我们使用 cBioDataPack 函数来下载和解析 luad_tcga 所包含的所有数据

luad <- cBioDataPack(cancer_study_id = "luad_tcga", ask = FALSE)

而 luad 就是一个 MultiAssayExperiment 对象

> class(luad)
[1] "MultiAssayExperiment"
attr(,"package")
[1] "MultiAssayExperiment"

我们可以查看该对象所包含的 slot

> slotNames(luad)
[1] "ExperimentList" "colData"        "sampleMap"      "drops"          "metadata"

其中，最重要的是

ExperimentList: 包含了所有的实验检测的数据，如突变、拷贝数、表达等

> experiments(luad)
ExperimentList class object of length 17:
 [1] cna_hg19.seg: RaggedExperiment with 81799 rows and 518 columns
 [2] CNA: SummarizedExperiment with 24776 rows and 516 columns
 [3] expression_median: SummarizedExperiment with 17814 rows and 32 columns
 [4] linear_CNA: SummarizedExperiment with 24776 rows and 516 columns
 [5] methylation_hm27_normals: SummarizedExperiment with 1788 rows and 24 columns
 [6] methylation_hm27: SummarizedExperiment with 1788 rows and 126 columns
 [7] methylation_hm450_normals: SummarizedExperiment with 16556 rows and 32 columns
 [8] methylation_hm450: SummarizedExperiment with 16556 rows and 460 columns
 [9] mRNA_median_all_sample_Zscores: SummarizedExperiment with 17814 rows and 32 columns
 [10] mRNA_median_Zscores: SummarizedExperiment with 16617 rows and 32 columns
 [11] mutations_extended: RaggedExperiment with 72541 rows and 230 columns
 [12] mutations_mskcc: RaggedExperiment with 72541 rows and 230 columns
 [13] RNA_Seq_v2_expression_median: SummarizedExperiment with 20531 rows and 517 columns
 [14] RNA_Seq_v2_mRNA_median_all_sample_Zscores: SummarizedExperiment with 20531 rows and 517 columns
 [15] RNA_Seq_v2_mRNA_median_Zscores: SummarizedExperiment with 20440 rows and 517 columns
 [16] rppa_Zscores: SummarizedExperiment with 222 rows and 365 columns
 [17] rppa: SummarizedExperiment with 223 rows and 365 columns

colData: 样本信息

> colData(luad)[1:4, 1:5]
DataFrame with 4 rows and 5 columns
               PATIENT_ID       SAMPLE_ID        OTHER_SAMPLE_ID SPECIMEN_CURRENT_WEIGHT DAYS_TO_COLLECTION
                                                    
TCGA-05-4244 TCGA-05-4244 TCGA-05-4244-01 bac0b02d-ac3b-4784-b..         [Not Available]    [Not Available]
TCGA-05-4249 TCGA-05-4249 TCGA-05-4249-01 80f196fe-1eaf-40cb-a..         [Not Available]    [Not Available]
TCGA-05-4250 TCGA-05-4250 TCGA-05-4250-01 8f274178-7a8e-46b6-8..         [Not Available]    [Not Available]
TCGA-05-4382 TCGA-05-4382 TCGA-05-4382-01 cce6d71f-369e-467f-b..         [Not Available]    [Not Available]

sampleMap: 样本与实验数据之间的对应关系

> sampleMap(luad)
DataFrame with 5029 rows and 3 columns
            assay      primary         colname
                
1    cna_hg19.seg TCGA-05-4244 TCGA-05-4244-01
2    cna_hg19.seg TCGA-05-4249 TCGA-05-4249-01
3    cna_hg19.seg TCGA-05-4250 TCGA-05-4250-01
4    cna_hg19.seg TCGA-05-4382 TCGA-05-4382-01
5    cna_hg19.seg TCGA-05-4384 TCGA-05-4384-01
...           ...          ...             ...
5025         rppa TCGA-NJ-A55O TCGA-NJ-A55O-01
5026         rppa TCGA-NJ-A55R TCGA-NJ-A55R-01
5027         rppa TCGA-NJ-A7XG TCGA-NJ-A7XG-01
5028         rppa TCGA-O1-A52J TCGA-O1-A52J-01
5029         rppa TCGA-S2-AA1A TCGA-S2-AA1A-01

如何获取实验数据呢？方法有很多种，用 assays 可以返回所有实验数据列表

> exp.all <- assays(luad)
> names(exp.all)
 [1] "cna_hg19.seg"                              "CNA"                                      
 [3] "expression_median"                         "linear_CNA"                               
 [5] "methylation_hm27_normals"                  "methylation_hm27"                         
 [7] "methylation_hm450_normals"                 "methylation_hm450"                        
 [9] "mRNA_median_all_sample_Zscores"            "mRNA_median_Zscores"                      
[11] "mutations_extended"                        "mutations_mskcc"                          
[13] "RNA_Seq_v2_expression_median"              "RNA_Seq_v2_mRNA_median_all_sample_Zscores"
[15] "RNA_Seq_v2_mRNA_median_Zscores"            "rppa_Zscores"                             
[17] "rppa"

获取 CNV 数据

> exp.all$CNA[1:5, 1:5]
        TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3                -1              -1              -1               0               0
ACTRT2               -1              -1              -1               0               0
AGRN                 -1              -1              -1               0               0
ANKRD65              -1              -1              -1               0               0
ATAD3A               -1              -1              -1               0               0

或者，使用 assay 函数并传递实验名称

> assay(luad, "CNA")[1:5, 1:5]
        TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3                -1              -1              -1               0               0
ACTRT2               -1              -1              -1               0               0
AGRN                 -1              -1              -1               0               0
ANKRD65              -1              -1              -1               0               0
ATAD3A               -1              -1              -1               0               0

具体的就不在这介绍了，我们重点关注的是数据的下载和解析

2. API

上面我们介绍的是直接下载和解析一个 study 的数据，但有时，我们关注的并不是一个 study 里面的数据，而是想看某一类型的数据在泛癌中的表现情况。

例如，想要统计所有 TCGA 的 study 中 CNA 的变异情况，如果下载所有的 study 数据然后提取对应的 CNA 数据，则会显得很麻烦很慢，所以我们要使用 REST API 来获取特定类型的数据

首先，初始化 API

cbio <- cBioPortal()

> class(cbio)
[1] "cBioPortal"
attr(,"package")
[1] "cBioPortalData"

使用 getStudies 获取所有 study 的 ID

studies <- getStudies(cbio)

> head(studies)
# A tibble: 6 × 12
  name     description     publicStudy pmid   citation groups status importDate allSampleCount studyId cancerTypeId
                                                            
1 Pan-Lun… "Whole-exome s… TRUE        27158… TCGA, N… ""          0 2021-04-0…           1144 nsclc_… nsclc       
2 Head an… "TCGA Head and… TRUE        NA     NA       "PUBL…      0 2021-04-2…            530 hnsc_t… hnsc        
3 Breast … "Whole-exome s… TRUE        26451… TCGA, C… "PUBL…      0 2021-04-2…            817 brca_t… brca        
4 Ovarian… "Whole exome s… TRUE        21720… TCGA, N… "PUBL…      0 2021-04-2…            489 ov_tcg… hgsoc       
5 Uterine… "Whole exome s… TRUE        23636… TCGA, N… "PUBL…      0 2021-04-2…            373 ucec_t… ucec        
6 Bladder… "Whole-exome s… TRUE        28988… Roberts… "PUBL…      0 2021-04-2…            413 blca_t… blca        
# … with 1 more variable: referenceGenome

或者，使用 cBioPortal 的对象方法

resp <- cbio$getAllStudiesUsingGET()

该方法返回的是 response 类型的对象，我们使用 httr::content 函数来解析

parsedResponse <- httr::content(resp)

返回的是列表，可以计算 study 的数量

> length(parsedResponse)
[1] 318
> dim(studies)
[1] 318  12

统计这些 study 所涉及的癌型和总样本数

> length(unique(studies$cancerTypeId))
[1] 94
> sum(studies$allSampleCount)
[1] 133449

你可能会有一个疑问，我是怎么知道 getStudies 函数的，我们可以使用 ls 来列出包中所有的函数

> ls("package:cBioPortalData")
 [1] "allSamples"            "cBioCache"             "cBioDataPack"          "cBioPortal"           
 [5] "cBioPortalData"        "clinicalData"          "downloadStudy"         "genePanelMolecular"   
 [9] "genePanels"            "geneTable"             "getDataByGenePanel"    "getGenePanel"         
[13] "getGenePanelMolecular" "getSampleInfo"         "getStudies"            "loadStudy"            
[17] "molecularData"         "molecularProfiles"     "mutationData"          "removeDataCache"      
[21] "removePackCache"       "sampleLists"           "samplesInSampleLists"  "searchOps"            
[25] "setCache"              "studiesTable"          "untarStudy"

例如，获取 sampleListId

> sampleLists(cbio, studyId = "luad_tcga")
# A tibble: 11 × 5
   category                                      name             description              sampleListId     studyId
                                                                                          
 1 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 2 all_cases_with_mutation_data                  Samples with mu… Samples with mutation d… luad_tcga_seque… luad_t…
 3 all_cases_with_rppa_data                      Samples with pr… Samples protein data (R… luad_tcga_rppa   luad_t…
 4 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 5 all_cases_with_mutation_and_cna_data          Samples with mu… Samples with mutation a… luad_tcga_cnaseq luad_t…
 6 all_cases_with_mrna_array_data                Samples with mR… Samples with mRNA expre… luad_tcga_mrna   luad_t…
 7 all_cases_with_methylation_data               Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
 8 all_cases_in_study                            All samples      All samples (586 sample… luad_tcga_all    luad_t…
 9 all_cases_with_cna_data                       Samples with CN… Samples with CNA data (… luad_tcga_cna    luad_t…
10 all_cases_with_mrna_rnaseq_data               Samples with mR… Samples with mRNA expre… luad_tcga_rna_s… luad_t…
11 all_cases_with_mutation_and_cna_and_mrna_data Complete samples Samples with mutation, … luad_tcga_3way_… luad_t…

根据 sampleListId 获取样本名称

> samplesInSampleLists(cbio, sampleListIds = c("luad_tcga_cna", "luad_tcga_mrna"))
CharacterList of length 2
[["luad_tcga_cna"]] TCGA-05-4384-01 TCGA-05-4390-01 TCGA-05-4425-01 ... TCGA-80-5611-01 TCGA-83-5908-01
[["luad_tcga_mrna"]] TCGA-35-3615-01 TCGA-44-2655-01 TCGA-44-2656-01 ... TCGA-44-3919-01 TCGA-44-4112-01

获取分子谱 molecularProfileId

> molecularProfiles(cbio, studyId = "luad_tcga")$molecularProfileId
 [1] "luad_tcga_rppa"                                      "luad_tcga_rppa_Zscores"                             
 [3] "luad_tcga_gistic"                                    "luad_tcga_mrna"                                     
 [5] "luad_tcga_mrna_median_Zscores"                       "luad_tcga_rna_seq_v2_mrna"                          
 [7] "luad_tcga_rna_seq_v2_mrna_median_Zscores"            "luad_tcga_linear_CNA"                               
 [9] "luad_tcga_methylation_hm27"                          "luad_tcga_methylation_hm450"                        
[11] "luad_tcga_mutations"                                 "luad_tcga_rna_seq_v2_mrna_median_all_sample_Zscores"
[13] "luad_tcga_mrna_median_all_sample_Zscores"

根据 molecularProfileId 获取对应的数据，需要传入相应的样本编号和基因 ID

exps <- molecularData(
  api = cbio,
  molecularProfileIds = c("luad_tcga_mrna"),
  entrezGeneIds = 1:1000,
  sampleIds = c("TCGA-35-3615-01", "TCGA-44-2655-01")
)

获取临床数据

> clinicalData(cbio, studyId = "luad_tcga")
# A tibble: 586 × 24
   uniqueSampleKey   uniquePatientKey   sampleId  patientId  studyId CANCER_TYPE  CANCER_TYPE_DET… FRACTION_GENOME…
                                                                           
 1 VENHQS0wNS00MjQ0… VENHQS0wNS00MjQ0O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.4565          
 2 VENHQS0wNS00MjQ1… VENHQS0wNS00MjQ1O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… NA              
 3 VENHQS0wNS00MjQ5… VENHQS0wNS00MjQ5O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2221          
 4 VENHQS0wNS00MjUw… VENHQS0wNS00MjUwO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2362          
 5 VENHQS0wNS00Mzgy… VENHQS0wNS00MzgyO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.0854          
 6 VENHQS0wNS00Mzg0… VENHQS0wNS00Mzg0O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.0661          
 7 VENHQS0wNS00Mzg5… VENHQS0wNS00Mzg5O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.4579          
 8 VENHQS0wNS00Mzkw… VENHQS0wNS00MzkwO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.3056          
 9 VENHQS0wNS00Mzk1… VENHQS0wNS00Mzk1O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.5598          
10 VENHQS0wNS00Mzk2… VENHQS0wNS00Mzk2O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2340          
# … with 576 more rows, and 16 more variables: IS_FFPE , LONGEST_DIMENSION , ONCOTREE_CODE ,
#   OTHER_SAMPLE_ID , PATHOLOGY_REPORT_FILE_NAME , PATHOLOGY_REPORT_UUID , SAMPLE_TYPE ,
#   SAMPLE_TYPE_ID , SHORTEST_DIMENSION , SOMATIC_STATUS , SPECIMEN_SECOND_LONGEST_DIMENSION ,
#   VIAL_NUMBER , MUTATION_COUNT , DAYS_TO_COLLECTION , OCT_EMBEDDED ,
#   SAMPLE_INITIAL_WEIGHT

3. 可视化

3.1 绘制 K-M 生存曲线

clinical <- colData(luad)
clinical[which(clinical$OS_MONTHS == "[Not Available]"), "OS_MONTHS"] <- NA
clinical$OS_MONTHS <- as.numeric(clinical$OS_MONTHS)

library(survival)
library(survminer)

fit <- survfit(
  Surv(OS_MONTHS, as.numeric(substr(OS_STATUS, 1, 1))) ~ SEX,
  data = clinical
)
ggsurvplot(fit, data = clinical, risk.table = TRUE)

3.2 展示样本数最多的 20 种癌型

library(tidyverse)

cancerTypeCounts <-                                
  studies %>%
  filter(cancerTypeId != "mixed") %>%
  group_by(cancerTypeId) %>%                       
  summarise(totalSamples=sum(allSampleCount)) %>%  
  arrange(desc(totalSamples)) %>%                  
  top_n(20) %>%
  mutate(cancerTypeId = toupper(cancerTypeId))

cancerTypeCounts %>%
  mutate(cancerTypeId = factor(cancerTypeId, levels = rev(cancerTypeId))) %>%
  ggplot(aes(x = cancerTypeId, fill = cancerTypeId)) +
  geom_col(aes(y = totalSamples), show.legend = FALSE) +
  coord_flip() +
  theme_classic()

3.3 展示突变频率最高的 20 基因

# 获取 luad_tcga 项目的所有突变数据
mutation <- cbio$getMutationsInMolecularProfileBySampleListIdUsingGET(
  molecularProfileId='luad_tcga_mutations',
  sampleListId='luad_tcga_all',
  entrezGeneId
  projection='DETAILED'
)
# 提取基因 Symbol 和样本 ID
res <- httr::content(mutation)
res.mat <- do.call(rbind, lapply(res, function (x) {
  return(data.frame(
      gene = x$gene$hugoGeneSymbol,
      sample = x$sampleId)
    )}
  )
)
# 绘制柱状图
group_by(res.mat, gene) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  top_n(20) %>%
  mutate(gene = factor(gene, levels = gene)) %>%
  ggplot(aes(gene, count)) +
  geom_col(fill = rainbow(20), show.legend = FALSE) +
  theme_bw() +
  theme(
    axis.text.x = element_text(angle = 45, vjust = 0.7),
    panel.border = element_blank(),panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),axis.line = element_line(colour = "black")
  )

Python

在 Python 中，我们使用 bravado 库来连接 Open API/Swagger 规范的 REST API，使用这种规范的数据库还蛮多的，像 OncoKB、GO、Genome Nexus 等

安装

pip install --upgrade bravado

1. 基本使用

导入并创建对象

In [1]: from bravado.client import SwaggerClient

In [2]: cbioportal = SwaggerClient.from_url(
   ...:     'https://www.cbioportal.org/api/api-docs',
   ...:     config={
   ...:         "validate_requests": False,
   ...:         "validate_responses": False
   ...:     }
   ...: )

In [3]: type(cbioportal)
Out[3]: bravado.client.SwaggerClient

我们可以使用 Tab 键智能提示对象方法

也可以使用 dir 函数来列出所有属性

比如，我们想获取所有癌症类型，选择 Cancer_Types，然后再使用 Tab 键提示

可以看到两个对象方法，一个用于获取所有癌型，一个用于获取指定癌型

再使用 ? 来获取函数的文档

我们可以看到，getCancerTypeUsingGET 函数需要传递一个参数 cancerTypeId，而且传入的是字符串类型，而返回的是一个 bravado.http_future.HTTPFuture 对象，状态码为 200 表示查询成功

同样的，对于 getAllCancerTypesUsingGET 函数

有 5 个可选的参数，有相应的默认值，可以对返回结果排序或进行分页访问等

对于返回的 HTTPFuture 对象，我们需要调用 response() 方法来获取返回结果

In [4]: res = cbioportal.Cancer_Types.getAllCancerTypesUsingGET().response()

res 包含 3 个属性

result 属性保存了我们查询的所有结果列表

In [5]: type(res.result)
Out[5]: list
In [6]: res.result
Out[6]: [TypeOfCancer(cancerTypeId='aa', dedicatedColor='LightYellow', name='Aggressive Angiomyxoma', parent='soft_tissue', shortName='AA'),
 TypeOfCancer(cancerTypeId='aastr', dedicatedColor='Gray', name='Anaplastic Astrocytoma', parent='difg', shortName='AASTR'),
 TypeOfCancer(cancerTypeId='abc', dedicatedColor='LimeGreen', name='Activated B-cell Type', parent='dlbclnos', shortName='ABC'),
 TypeOfCancer(cancerTypeId='abl', dedicatedColor='LightSalmon', name='Acute Basophilic Leukemia', parent='amlnos', shortName='ABL'),
 TypeOfCancer(cancerTypeId='aca', dedicatedColor='Purple', name='Adrenocortical Adenoma', parent='adrenal_gland', shortName='ACA'),
 TypeOfCancer(cancerTypeId='acbc', dedicatedColor='HotPink', name='Adenoid Cystic Breast Cancer', parent='brca', shortName='ACBC'),
 ...]

使用 dir 来获取癌型类所包含的字段

In [7]: cancer_types = res.result

In [8]: len(cancer_types)
Out[8]: 869

In [9]: dir(cancer_types[0])
Out[9]: ['cancerTypeId', 'dedicatedColor', 'name', 'parent', 'shortName']

我们定义一个函数，用于将返回结果转换为 pd.DataFrame

In [10]: import pandas as pd
    ...: from collections import defaultdict

In [11]: def DataFrame(res):
    ...:     res_dict = defaultdict(list)
    ...:     for obj in res:
    ...:         for attr in dir(obj):
    ...:             res_dict[attr].append(obj[attr])
    ...:     return pd.DataFrame.from_dict(res_dict)
    ...:

In [14]: DataFrame(cancer_types)
Out[14]: 
    cancerTypeId dedicatedColor                                  name         parent shortName
0             aa    LightYellow                Aggressive Angiomyxoma    soft_tissue        AA
1          aastr           Gray                Anaplastic Astrocytoma           difg     AASTR
2            abc      LimeGreen                 Activated B-cell Type       dlbclnos       ABC
3            abl    LightSalmon             Acute Basophilic Leukemia         amlnos       ABL
4            aca         Purple                Adrenocortical Adenoma  adrenal_gland       ACA
..           ...            ...                                   ...            ...       ...
864         wdls    LightYellow       Well-Differentiated Liposarcoma           lipo      WDLS
865         wdtc           Teal    Well-Differentiated Thyroid Cancer        thyroid      WDTC
866           wm      LimeGreen         Waldenstrom Macroglobulinemia            lpl        WM
867        wpscc           Blue  Warty Penile Squamous Cell Carcinoma           pscc     WPSCC
868           wt         Orange                          Wilms' Tumor         kidney        WT

[869 rows x 5 columns]

类似地，我们可以获取所有 study

res = cbioportal.Studies.getAllStudiesUsingGET().response()
studies = DataFrame(res.result)

获取基因信息

In [15]: res = cbioportal.Genes.getAllGenesUsingGET().response()

In [16]: gene_info = DataFrame(res.result)

In [17]: gene_info.query('hugoGeneSymbol == "TP53"')
Out[17]: 
        entrezGeneId geneticEntityId hugoGeneSymbol            type
113919          7157            None           TP53  protein-coding

获取样本信息

# 样本列表 ID
In [18]: res = cbioportal.Sample_Lists.getAllSampleListsInStudyUsingGET(studyId='acc_tcga').response()
    ...: sample_list = DataFrame(res.result)

# 所有样本
In [19]: res = cbioportal.Samples.getAllSamplesInStudyUsingGET(studyId='acc_tcga').response()
    ...: acc_samples = DataFrame(res.result)

获取临床信息

In [20]: res = cbioportal.Clinical_Data.getAllClinicalDataInStudyUsingGET(studyId='acc_tcga').response()
    ...: clin_info = DataFrame(res.result)

In [21]: clin_info.head()
Out[21]: 
  clinicalAttribute      clinicalAttributeId  ...                   uniqueSampleKey                     value
0              None              CANCER_TYPE  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh  Adrenocortical Carcinoma
1              None     CANCER_TYPE_DETAILED  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh  Adrenocortical Carcinoma
2              None       DAYS_TO_COLLECTION  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh                      4691
3              None  FRACTION_GENOME_ALTERED  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh                    0.0585
4              None                  IS_FFPE  ...  VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh                        NO

[5 rows x 8 columns]

这些临床信息以长表的信息存储的，我们将其转换为宽表

In [22]: clin_long = clin_info[['clinicalAttributeId', 'sampleId', 'value']]

In [23]: clin_table = clin_long.pivot(index='clinicalAttributeId', columns='sampleId', values='value')

In [24]: clin_table.head()
Out[24]: 
sampleId                          TCGA-OR-A5J1-01           TCGA-OR-A5J2-01  ...           TCGA-PK-A5HB-01           TCGA-PK-A5HC-01
clinicalAttributeId                                                          ...                                                    
CANCER_TYPE              Adrenocortical Carcinoma  Adrenocortical Carcinoma  ...  Adrenocortical Carcinoma  Adrenocortical Carcinoma
CANCER_TYPE_DETAILED     Adrenocortical Carcinoma  Adrenocortical Carcinoma  ...  Adrenocortical Carcinoma  Adrenocortical Carcinoma
DAYS_TO_COLLECTION                           4691                      3124  ...                      3385                       359
FRACTION_GENOME_ALTERED                    0.0585                    0.4033  ...                    0.9279                    0.2901
IS_FFPE                                        NO                        NO  ...                        NO                        NO

[5 rows x 92 columns]

获取突变信息

先获取突变的分子谱 ID，一般是 studyId 加上 _mutations 后缀

In [25]: res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId='acc_tcga').response()
    ...: MP = DataFrame(res.result)

In [26]: MP.molecularProfileId
Out[26]: 
0                                        acc_tcga_rppa
1                                acc_tcga_rppa_Zscores
2                                      acc_tcga_gistic
3                             acc_tcga_rna_seq_v2_mrna
4              acc_tcga_rna_seq_v2_mrna_median_Zscores
5                                  acc_tcga_linear_CNA
6                           acc_tcga_methylation_hm450
7                                   acc_tcga_mutations
8    acc_tcga_rna_seq_v2_mrna_median_all_sample_Zsc...
Name: molecularProfileId, dtype: object

如果要获取所有样本的突变，可以传递以 _all 为后缀的 sampleListId

In [27]: studyId = 'acc_tcga'
    ...: res = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
    ...:     molecularProfileId = studyId + "_mutations",
    ...:     sampleListId = studyId + "_all",
    ...:     projection = 'DETAILED'
    ...: ).response()
    ...: mutations = DataFrame(res.result)
    
In [28]: mutations.head()
Out[28]: 
  alleleSpecificCopyNumber aminoAcidChange                                          center  ... validationStatus variantAllele variantType
0                     None            None                                   broad.mit.edu  ...         Untested             T         INS
1                     None            None                                    hgsc.bcm.edu  ...         Untested             T         SNP
2                     None            None  broad.mit.edu;ucsc.edu;bcgsc.ca;mdanderson.org  ...         Untested             T         SNP
3                     None            None  broad.mit.edu;ucsc.edu;bcgsc.ca;mdanderson.org  ...         Untested             T         SNP
4                     None            None                                   broad.mit.edu  ...         Untested             -         DEL

[5 rows x 40 columns]

获取 CNA 数据

获取拷贝数变异数据，需要注意一点，它的 molecularProfileId 并不是都以 gistic 结尾的，应该是流程不一样，有的是以 cna 结尾，例如

In [29]: res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId='nsclc_tcga_broad_2016').response()
    ...: MP = DataFrame(res.result)
    
In [30]: MP.molecularProfileId
Out[30]: 
0          nsclc_tcga_broad_2016_cna
1    nsclc_tcga_broad_2016_mutations
2       nsclc_tcga_broad_2016_fusion
Name: molecularProfileId, dtype: object

获取全部 CNA 数据

studyId = 'nsclc_tcga_broad_2016'
res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId=studyId).response()
MP = DataFrame(res.result)
molecularProfileId = MP.query('datatype == "DISCRETE"')['molecularProfileId'].values[0]

res = cbioportal.Discrete_Copy_Number_Alterations.getDiscreteCopyNumbersInMolecularProfileUsingGET(
    molecularProfileId = molecularProfileId,
    sampleListId = studyId + "_all",
    projection = 'DETAILED'
).response()
cna = DataFrame(res.result)
# 转换为宽矩阵型
cna_long = cna[['sampleId', 'entrezGeneId', 'alteration']]
cna_table = cna_long.pivot(index='entrezGeneId', columns='sampleId', values='alteration')

2. 可视化

获取完数据之后，可视化也就没什么难度了。但是我们还没介绍 Python 的可视化部分，这里就简单举个例子吧

我们使用的是 plotnine 模块，相当于 ggplo2 的 Python 扩展，语法用起来几乎是一样的，这样看起来也就不会有什么难度了

2.1 样本的突变和拷贝数

from plotnine import *

studyId = 'acc_tcga'
# 拷贝数
res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId=studyId).response()
MP = DataFrame(res.result)
molecularProfileId = MP.query('datatype == "DISCRETE"')['molecularProfileId'].values[0]

res = cbioportal.Discrete_Copy_Number_Alterations.getDiscreteCopyNumbersInMolecularProfileUsingGET(
    molecularProfileId = molecularProfileId,
    sampleListId = studyId + "_all",
    projection = 'DETAILED'
).response()
cna = DataFrame(res.result)

cna_long = cna[['sampleId', 'entrezGeneId', 'alteration']]
cna_table = cna_long.pivot(index='entrezGeneId', columns='sampleId', values='alteration')
# 突变
res = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
    molecularProfileId = studyId + "_mutations",
    sampleListId = studyId + "_all",
    projection = 'DETAILED'
).response()
mutations = DataFrame(res.result)

# 统计样本的突变情况
sample_mut_count = mutations.groupby('sampleId').apply(lambda x: x.shape[0]).to_frame()
sample_mut_count.columns = ['mutation']
sample_mut_count = sample_mut_count.reset_index()
# 统计样本的 cna 情况
sample_cna_count = cna_table.count().to_frame()
sample_cna_count.columns = ['cna']
sample_cna_count = sample_cna_count.reset_index()
# 合并数据
mut_cna = pd.merge(sample_mut_count, sample_cna_count)

绘制图形
(
    ggplot(mut_cna)
    + geom_point(aes(x='cna', y='mutation', colour='sampleId'), 
                 show_legend=False, alpha=0.7)
    + theme_bw()
)

样本的突变和拷贝数变异好像有那么点像倒数函数曲线

2.2 样本占比

from collections import Counter

res = cbioportal.Clinical_Data.getAllClinicalDataInStudyUsingGET(studyId='brca_tcga_pub').response()
clin_info = DataFrame(res.result)

clin_long = clin_info[['clinicalAttributeId', 'sampleId', 'value']]
clin_table = clin_long.pivot(index='clinicalAttributeId', columns='sampleId', values='value')
clin_table = clin_table.transpose()

df = clin_table[['ER_STATUS', 'HER2_STATUS', 'PR_STATUS']]
df = df.applymap(lambda x: x if x in ['Positive', 'Negative'] else 'NA')

(
    ggplot(df.melt()) 
    + geom_bar(aes('clinicalAttributeId', fill='value'))
    + scale_fill_manual(values=['#bebada', '#80b1d3', '#fb8072'])
    + theme_classic()
)

cBioPortal 数据库 API 使用

前言

R

Python

cBioPortalData

1. 数据结构

2. API

3. 可视化

3.1 绘制 K-M 生存曲线

3.2 展示样本数最多的 20 种癌型

3.3 展示突变频率最高的 20 基因

Python

1. 基本使用

获取基因信息

获取样本信息

获取临床信息

获取突变信息

获取 CNA 数据

2. 可视化

2.1 样本的突变和拷贝数

2.2 样本占比

你可能感兴趣的:(cBioPortal 数据库 API 使用)