前言
虽然 cBioPortal
数据库提供了很多交互式可视化图表来展示和探索不同组学癌症数据集中的基因组信息。但是,我们想要进行更深入的分析,就需要自己下载数据并进行个性化分析
cBioPortal
并没有一键式批量下载所有数据集的功能,只能选择对应的研究(study
)一步步下载。当我们需要分析的数据集涉及多种癌型或想要进行泛癌分析时,这种方式还是偏繁琐。
因此,cBioPortal
为我们提供了 REST API
接口,我们可以通过代码来进行批量下载。接口使用的是 Swagger/OpenAPI
规范,该规范主要是用于描述 REST API
,既可以作为文档给开发者阅读,也可以让机器根据这个文档自动生成客户端代码等
API
文档的链接地址:https://www.cbioportal.org/api/api-docs
文档是 json
格式的,里面详细描述了各函数及参数的格式和使用方式
使用这个 API
的方式较多,包括 R
和 Python
:
R
-
cBioPortalData
: 推荐使用这个包 -
rapiclient
: 也可以用这个包,解析API
文档
library(rapiclient)
client <- get_api(url = "https://www.cbioportal.org/api/api-docs")
-
CGDSR
:不推荐使用,将会废弃
Python
bravado
from bravado.client import SwaggerClient
cbioportal = SwaggerClient.from_url(
'https://www.cbioportal.org/api/api-docs',
config={
"validate_requests": False,
"validate_responses": False
}
)
我们分别介绍 cBioPortalData
包和 bravado
包的使用。如果你使用的是其他编程语言,只要使用对应的能解析 OpenAPI
规范的 API
文档的包就行
cBioPortalData
安装导入
BiocManager::install("cBioPortalData")
library(cBioPortalData)
1. 数据结构
cBioPortalData
会将每一个 study
保存为一个 MultiAssayExperiment
结构的对象(S4
)
例如,我们使用 cBioDataPack
函数来下载和解析 luad_tcga
所包含的所有数据
luad <- cBioDataPack(cancer_study_id = "luad_tcga", ask = FALSE)
而 luad
就是一个 MultiAssayExperiment
对象
> class(luad)
[1] "MultiAssayExperiment"
attr(,"package")
[1] "MultiAssayExperiment"
我们可以查看该对象所包含的 slot
> slotNames(luad)
[1] "ExperimentList" "colData" "sampleMap" "drops" "metadata"
其中,最重要的是
-
ExperimentList
: 包含了所有的实验检测的数据,如突变、拷贝数、表达等
> experiments(luad)
ExperimentList class object of length 17:
[1] cna_hg19.seg: RaggedExperiment with 81799 rows and 518 columns
[2] CNA: SummarizedExperiment with 24776 rows and 516 columns
[3] expression_median: SummarizedExperiment with 17814 rows and 32 columns
[4] linear_CNA: SummarizedExperiment with 24776 rows and 516 columns
[5] methylation_hm27_normals: SummarizedExperiment with 1788 rows and 24 columns
[6] methylation_hm27: SummarizedExperiment with 1788 rows and 126 columns
[7] methylation_hm450_normals: SummarizedExperiment with 16556 rows and 32 columns
[8] methylation_hm450: SummarizedExperiment with 16556 rows and 460 columns
[9] mRNA_median_all_sample_Zscores: SummarizedExperiment with 17814 rows and 32 columns
[10] mRNA_median_Zscores: SummarizedExperiment with 16617 rows and 32 columns
[11] mutations_extended: RaggedExperiment with 72541 rows and 230 columns
[12] mutations_mskcc: RaggedExperiment with 72541 rows and 230 columns
[13] RNA_Seq_v2_expression_median: SummarizedExperiment with 20531 rows and 517 columns
[14] RNA_Seq_v2_mRNA_median_all_sample_Zscores: SummarizedExperiment with 20531 rows and 517 columns
[15] RNA_Seq_v2_mRNA_median_Zscores: SummarizedExperiment with 20440 rows and 517 columns
[16] rppa_Zscores: SummarizedExperiment with 222 rows and 365 columns
[17] rppa: SummarizedExperiment with 223 rows and 365 columns
-
colData
: 样本信息
> colData(luad)[1:4, 1:5]
DataFrame with 4 rows and 5 columns
PATIENT_ID SAMPLE_ID OTHER_SAMPLE_ID SPECIMEN_CURRENT_WEIGHT DAYS_TO_COLLECTION
TCGA-05-4244 TCGA-05-4244 TCGA-05-4244-01 bac0b02d-ac3b-4784-b.. [Not Available] [Not Available]
TCGA-05-4249 TCGA-05-4249 TCGA-05-4249-01 80f196fe-1eaf-40cb-a.. [Not Available] [Not Available]
TCGA-05-4250 TCGA-05-4250 TCGA-05-4250-01 8f274178-7a8e-46b6-8.. [Not Available] [Not Available]
TCGA-05-4382 TCGA-05-4382 TCGA-05-4382-01 cce6d71f-369e-467f-b.. [Not Available] [Not Available]
-
sampleMap
: 样本与实验数据之间的对应关系
> sampleMap(luad)
DataFrame with 5029 rows and 3 columns
assay primary colname
1 cna_hg19.seg TCGA-05-4244 TCGA-05-4244-01
2 cna_hg19.seg TCGA-05-4249 TCGA-05-4249-01
3 cna_hg19.seg TCGA-05-4250 TCGA-05-4250-01
4 cna_hg19.seg TCGA-05-4382 TCGA-05-4382-01
5 cna_hg19.seg TCGA-05-4384 TCGA-05-4384-01
... ... ... ...
5025 rppa TCGA-NJ-A55O TCGA-NJ-A55O-01
5026 rppa TCGA-NJ-A55R TCGA-NJ-A55R-01
5027 rppa TCGA-NJ-A7XG TCGA-NJ-A7XG-01
5028 rppa TCGA-O1-A52J TCGA-O1-A52J-01
5029 rppa TCGA-S2-AA1A TCGA-S2-AA1A-01
如何获取实验数据呢?方法有很多种,用 assays
可以返回所有实验数据列表
> exp.all <- assays(luad)
> names(exp.all)
[1] "cna_hg19.seg" "CNA"
[3] "expression_median" "linear_CNA"
[5] "methylation_hm27_normals" "methylation_hm27"
[7] "methylation_hm450_normals" "methylation_hm450"
[9] "mRNA_median_all_sample_Zscores" "mRNA_median_Zscores"
[11] "mutations_extended" "mutations_mskcc"
[13] "RNA_Seq_v2_expression_median" "RNA_Seq_v2_mRNA_median_all_sample_Zscores"
[15] "RNA_Seq_v2_mRNA_median_Zscores" "rppa_Zscores"
[17] "rppa"
获取 CNV
数据
> exp.all$CNA[1:5, 1:5]
TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3 -1 -1 -1 0 0
ACTRT2 -1 -1 -1 0 0
AGRN -1 -1 -1 0 0
ANKRD65 -1 -1 -1 0 0
ATAD3A -1 -1 -1 0 0
或者,使用 assay
函数并传递实验名称
> assay(luad, "CNA")[1:5, 1:5]
TCGA-05-4244-01 TCGA-05-4249-01 TCGA-05-4250-01 TCGA-05-4382-01 TCGA-05-4384-01
ACAP3 -1 -1 -1 0 0
ACTRT2 -1 -1 -1 0 0
AGRN -1 -1 -1 0 0
ANKRD65 -1 -1 -1 0 0
ATAD3A -1 -1 -1 0 0
具体的就不在这介绍了,我们重点关注的是数据的下载和解析
2. API
上面我们介绍的是直接下载和解析一个 study
的数据,但有时,我们关注的并不是一个 study
里面的数据,而是想看某一类型的数据在泛癌中的表现情况。
例如,想要统计所有 TCGA
的 study
中 CNA
的变异情况,如果下载所有的 study
数据然后提取对应的 CNA
数据,则会显得很麻烦很慢,所以我们要使用 REST API
来获取特定类型的数据
首先,初始化 API
cbio <- cBioPortal()
> class(cbio)
[1] "cBioPortal"
attr(,"package")
[1] "cBioPortalData"
使用 getStudies
获取所有 study
的 ID
studies <- getStudies(cbio)
> head(studies)
# A tibble: 6 × 12
name description publicStudy pmid citation groups status importDate allSampleCount studyId cancerTypeId
1 Pan-Lun… "Whole-exome s… TRUE 27158… TCGA, N… "" 0 2021-04-0… 1144 nsclc_… nsclc
2 Head an… "TCGA Head and… TRUE NA NA "PUBL… 0 2021-04-2… 530 hnsc_t… hnsc
3 Breast … "Whole-exome s… TRUE 26451… TCGA, C… "PUBL… 0 2021-04-2… 817 brca_t… brca
4 Ovarian… "Whole exome s… TRUE 21720… TCGA, N… "PUBL… 0 2021-04-2… 489 ov_tcg… hgsoc
5 Uterine… "Whole exome s… TRUE 23636… TCGA, N… "PUBL… 0 2021-04-2… 373 ucec_t… ucec
6 Bladder… "Whole-exome s… TRUE 28988… Roberts… "PUBL… 0 2021-04-2… 413 blca_t… blca
# … with 1 more variable: referenceGenome
或者,使用 cBioPortal
的对象方法
resp <- cbio$getAllStudiesUsingGET()
该方法返回的是 response
类型的对象,我们使用 httr::content
函数来解析
parsedResponse <- httr::content(resp)
返回的是列表,可以计算 study
的数量
> length(parsedResponse)
[1] 318
> dim(studies)
[1] 318 12
统计这些 study
所涉及的癌型和总样本数
> length(unique(studies$cancerTypeId))
[1] 94
> sum(studies$allSampleCount)
[1] 133449
你可能会有一个疑问,我是怎么知道 getStudies
函数的,我们可以使用 ls
来列出包中所有的函数
> ls("package:cBioPortalData")
[1] "allSamples" "cBioCache" "cBioDataPack" "cBioPortal"
[5] "cBioPortalData" "clinicalData" "downloadStudy" "genePanelMolecular"
[9] "genePanels" "geneTable" "getDataByGenePanel" "getGenePanel"
[13] "getGenePanelMolecular" "getSampleInfo" "getStudies" "loadStudy"
[17] "molecularData" "molecularProfiles" "mutationData" "removeDataCache"
[21] "removePackCache" "sampleLists" "samplesInSampleLists" "searchOps"
[25] "setCache" "studiesTable" "untarStudy"
例如,获取 sampleListId
> sampleLists(cbio, studyId = "luad_tcga")
# A tibble: 11 × 5
category name description sampleListId studyId
1 all_cases_with_methylation_data Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
2 all_cases_with_mutation_data Samples with mu… Samples with mutation d… luad_tcga_seque… luad_t…
3 all_cases_with_rppa_data Samples with pr… Samples protein data (R… luad_tcga_rppa luad_t…
4 all_cases_with_methylation_data Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
5 all_cases_with_mutation_and_cna_data Samples with mu… Samples with mutation a… luad_tcga_cnaseq luad_t…
6 all_cases_with_mrna_array_data Samples with mR… Samples with mRNA expre… luad_tcga_mrna luad_t…
7 all_cases_with_methylation_data Samples with me… Samples with methylatio… luad_tcga_methy… luad_t…
8 all_cases_in_study All samples All samples (586 sample… luad_tcga_all luad_t…
9 all_cases_with_cna_data Samples with CN… Samples with CNA data (… luad_tcga_cna luad_t…
10 all_cases_with_mrna_rnaseq_data Samples with mR… Samples with mRNA expre… luad_tcga_rna_s… luad_t…
11 all_cases_with_mutation_and_cna_and_mrna_data Complete samples Samples with mutation, … luad_tcga_3way_… luad_t…
根据 sampleListId
获取样本名称
> samplesInSampleLists(cbio, sampleListIds = c("luad_tcga_cna", "luad_tcga_mrna"))
CharacterList of length 2
[["luad_tcga_cna"]] TCGA-05-4384-01 TCGA-05-4390-01 TCGA-05-4425-01 ... TCGA-80-5611-01 TCGA-83-5908-01
[["luad_tcga_mrna"]] TCGA-35-3615-01 TCGA-44-2655-01 TCGA-44-2656-01 ... TCGA-44-3919-01 TCGA-44-4112-01
获取分子谱 molecularProfileId
> molecularProfiles(cbio, studyId = "luad_tcga")$molecularProfileId
[1] "luad_tcga_rppa" "luad_tcga_rppa_Zscores"
[3] "luad_tcga_gistic" "luad_tcga_mrna"
[5] "luad_tcga_mrna_median_Zscores" "luad_tcga_rna_seq_v2_mrna"
[7] "luad_tcga_rna_seq_v2_mrna_median_Zscores" "luad_tcga_linear_CNA"
[9] "luad_tcga_methylation_hm27" "luad_tcga_methylation_hm450"
[11] "luad_tcga_mutations" "luad_tcga_rna_seq_v2_mrna_median_all_sample_Zscores"
[13] "luad_tcga_mrna_median_all_sample_Zscores"
根据 molecularProfileId
获取对应的数据,需要传入相应的样本编号和基因 ID
exps <- molecularData(
api = cbio,
molecularProfileIds = c("luad_tcga_mrna"),
entrezGeneIds = 1:1000,
sampleIds = c("TCGA-35-3615-01", "TCGA-44-2655-01")
)
获取临床数据
> clinicalData(cbio, studyId = "luad_tcga")
# A tibble: 586 × 24
uniqueSampleKey uniquePatientKey sampleId patientId studyId CANCER_TYPE CANCER_TYPE_DET… FRACTION_GENOME…
1 VENHQS0wNS00MjQ0… VENHQS0wNS00MjQ0O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.4565
2 VENHQS0wNS00MjQ1… VENHQS0wNS00MjQ1O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… NA
3 VENHQS0wNS00MjQ5… VENHQS0wNS00MjQ5O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2221
4 VENHQS0wNS00MjUw… VENHQS0wNS00MjUwO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2362
5 VENHQS0wNS00Mzgy… VENHQS0wNS00MzgyO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.0854
6 VENHQS0wNS00Mzg0… VENHQS0wNS00Mzg0O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.0661
7 VENHQS0wNS00Mzg5… VENHQS0wNS00Mzg5O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.4579
8 VENHQS0wNS00Mzkw… VENHQS0wNS00MzkwO… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.3056
9 VENHQS0wNS00Mzk1… VENHQS0wNS00Mzk1O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.5598
10 VENHQS0wNS00Mzk2… VENHQS0wNS00Mzk2O… TCGA-05-… TCGA-05-4… luad_t… Non-Small C… Lung Adenocarci… 0.2340
# … with 576 more rows, and 16 more variables: IS_FFPE , LONGEST_DIMENSION , ONCOTREE_CODE ,
# OTHER_SAMPLE_ID , PATHOLOGY_REPORT_FILE_NAME , PATHOLOGY_REPORT_UUID , SAMPLE_TYPE ,
# SAMPLE_TYPE_ID , SHORTEST_DIMENSION , SOMATIC_STATUS , SPECIMEN_SECOND_LONGEST_DIMENSION ,
# VIAL_NUMBER , MUTATION_COUNT , DAYS_TO_COLLECTION , OCT_EMBEDDED ,
# SAMPLE_INITIAL_WEIGHT
3. 可视化
3.1 绘制 K-M 生存曲线
clinical <- colData(luad)
clinical[which(clinical$OS_MONTHS == "[Not Available]"), "OS_MONTHS"] <- NA
clinical$OS_MONTHS <- as.numeric(clinical$OS_MONTHS)
library(survival)
library(survminer)
fit <- survfit(
Surv(OS_MONTHS, as.numeric(substr(OS_STATUS, 1, 1))) ~ SEX,
data = clinical
)
ggsurvplot(fit, data = clinical, risk.table = TRUE)
3.2 展示样本数最多的 20 种癌型
library(tidyverse)
cancerTypeCounts <-
studies %>%
filter(cancerTypeId != "mixed") %>%
group_by(cancerTypeId) %>%
summarise(totalSamples=sum(allSampleCount)) %>%
arrange(desc(totalSamples)) %>%
top_n(20) %>%
mutate(cancerTypeId = toupper(cancerTypeId))
cancerTypeCounts %>%
mutate(cancerTypeId = factor(cancerTypeId, levels = rev(cancerTypeId))) %>%
ggplot(aes(x = cancerTypeId, fill = cancerTypeId)) +
geom_col(aes(y = totalSamples), show.legend = FALSE) +
coord_flip() +
theme_classic()
3.3 展示突变频率最高的 20 基因
# 获取 luad_tcga 项目的所有突变数据
mutation <- cbio$getMutationsInMolecularProfileBySampleListIdUsingGET(
molecularProfileId='luad_tcga_mutations',
sampleListId='luad_tcga_all',
entrezGeneId
projection='DETAILED'
)
# 提取基因 Symbol 和样本 ID
res <- httr::content(mutation)
res.mat <- do.call(rbind, lapply(res, function (x) {
return(data.frame(
gene = x$gene$hugoGeneSymbol,
sample = x$sampleId)
)}
)
)
# 绘制柱状图
group_by(res.mat, gene) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
top_n(20) %>%
mutate(gene = factor(gene, levels = gene)) %>%
ggplot(aes(gene, count)) +
geom_col(fill = rainbow(20), show.legend = FALSE) +
theme_bw() +
theme(
axis.text.x = element_text(angle = 45, vjust = 0.7),
panel.border = element_blank(),panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),axis.line = element_line(colour = "black")
)
Python
在 Python
中,我们使用 bravado
库来连接 Open API/Swagger
规范的 REST API
,使用这种规范的数据库还蛮多的,像 OncoKB
、GO
、Genome Nexus
等
安装
pip install --upgrade bravado
1. 基本使用
导入并创建对象
In [1]: from bravado.client import SwaggerClient
In [2]: cbioportal = SwaggerClient.from_url(
...: 'https://www.cbioportal.org/api/api-docs',
...: config={
...: "validate_requests": False,
...: "validate_responses": False
...: }
...: )
In [3]: type(cbioportal)
Out[3]: bravado.client.SwaggerClient
我们可以使用 Tab
键智能提示对象方法
也可以使用 dir
函数来列出所有属性
比如,我们想获取所有癌症类型,选择 Cancer_Types
,然后再使用 Tab
键提示
可以看到两个对象方法,一个用于获取所有癌型,一个用于获取指定癌型
再使用 ?
来获取函数的文档
我们可以看到,getCancerTypeUsingGET
函数需要传递一个参数 cancerTypeId
,而且传入的是字符串类型,而返回的是一个 bravado.http_future.HTTPFuture
对象,状态码为 200
表示查询成功
同样的,对于 getAllCancerTypesUsingGET
函数
有 5
个可选的参数,有相应的默认值,可以对返回结果排序或进行分页访问等
对于返回的 HTTPFuture
对象,我们需要调用 response()
方法来获取返回结果
In [4]: res = cbioportal.Cancer_Types.getAllCancerTypesUsingGET().response()
res
包含 3
个属性
result
属性保存了我们查询的所有结果列表
In [5]: type(res.result)
Out[5]: list
In [6]: res.result
Out[6]: [TypeOfCancer(cancerTypeId='aa', dedicatedColor='LightYellow', name='Aggressive Angiomyxoma', parent='soft_tissue', shortName='AA'),
TypeOfCancer(cancerTypeId='aastr', dedicatedColor='Gray', name='Anaplastic Astrocytoma', parent='difg', shortName='AASTR'),
TypeOfCancer(cancerTypeId='abc', dedicatedColor='LimeGreen', name='Activated B-cell Type', parent='dlbclnos', shortName='ABC'),
TypeOfCancer(cancerTypeId='abl', dedicatedColor='LightSalmon', name='Acute Basophilic Leukemia', parent='amlnos', shortName='ABL'),
TypeOfCancer(cancerTypeId='aca', dedicatedColor='Purple', name='Adrenocortical Adenoma', parent='adrenal_gland', shortName='ACA'),
TypeOfCancer(cancerTypeId='acbc', dedicatedColor='HotPink', name='Adenoid Cystic Breast Cancer', parent='brca', shortName='ACBC'),
...]
使用 dir
来获取癌型类所包含的字段
In [7]: cancer_types = res.result
In [8]: len(cancer_types)
Out[8]: 869
In [9]: dir(cancer_types[0])
Out[9]: ['cancerTypeId', 'dedicatedColor', 'name', 'parent', 'shortName']
我们定义一个函数,用于将返回结果转换为 pd.DataFrame
In [10]: import pandas as pd
...: from collections import defaultdict
In [11]: def DataFrame(res):
...: res_dict = defaultdict(list)
...: for obj in res:
...: for attr in dir(obj):
...: res_dict[attr].append(obj[attr])
...: return pd.DataFrame.from_dict(res_dict)
...:
In [14]: DataFrame(cancer_types)
Out[14]:
cancerTypeId dedicatedColor name parent shortName
0 aa LightYellow Aggressive Angiomyxoma soft_tissue AA
1 aastr Gray Anaplastic Astrocytoma difg AASTR
2 abc LimeGreen Activated B-cell Type dlbclnos ABC
3 abl LightSalmon Acute Basophilic Leukemia amlnos ABL
4 aca Purple Adrenocortical Adenoma adrenal_gland ACA
.. ... ... ... ... ...
864 wdls LightYellow Well-Differentiated Liposarcoma lipo WDLS
865 wdtc Teal Well-Differentiated Thyroid Cancer thyroid WDTC
866 wm LimeGreen Waldenstrom Macroglobulinemia lpl WM
867 wpscc Blue Warty Penile Squamous Cell Carcinoma pscc WPSCC
868 wt Orange Wilms' Tumor kidney WT
[869 rows x 5 columns]
类似地,我们可以获取所有 study
res = cbioportal.Studies.getAllStudiesUsingGET().response()
studies = DataFrame(res.result)
获取基因信息
In [15]: res = cbioportal.Genes.getAllGenesUsingGET().response()
In [16]: gene_info = DataFrame(res.result)
In [17]: gene_info.query('hugoGeneSymbol == "TP53"')
Out[17]:
entrezGeneId geneticEntityId hugoGeneSymbol type
113919 7157 None TP53 protein-coding
获取样本信息
# 样本列表 ID
In [18]: res = cbioportal.Sample_Lists.getAllSampleListsInStudyUsingGET(studyId='acc_tcga').response()
...: sample_list = DataFrame(res.result)
# 所有样本
In [19]: res = cbioportal.Samples.getAllSamplesInStudyUsingGET(studyId='acc_tcga').response()
...: acc_samples = DataFrame(res.result)
获取临床信息
In [20]: res = cbioportal.Clinical_Data.getAllClinicalDataInStudyUsingGET(studyId='acc_tcga').response()
...: clin_info = DataFrame(res.result)
In [21]: clin_info.head()
Out[21]:
clinicalAttribute clinicalAttributeId ... uniqueSampleKey value
0 None CANCER_TYPE ... VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh Adrenocortical Carcinoma
1 None CANCER_TYPE_DETAILED ... VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh Adrenocortical Carcinoma
2 None DAYS_TO_COLLECTION ... VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh 4691
3 None FRACTION_GENOME_ALTERED ... VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh 0.0585
4 None IS_FFPE ... VENHQS1PUi1BNUoxLTAxOmFjY190Y2dh NO
[5 rows x 8 columns]
这些临床信息以长表的信息存储的,我们将其转换为宽表
In [22]: clin_long = clin_info[['clinicalAttributeId', 'sampleId', 'value']]
In [23]: clin_table = clin_long.pivot(index='clinicalAttributeId', columns='sampleId', values='value')
In [24]: clin_table.head()
Out[24]:
sampleId TCGA-OR-A5J1-01 TCGA-OR-A5J2-01 ... TCGA-PK-A5HB-01 TCGA-PK-A5HC-01
clinicalAttributeId ...
CANCER_TYPE Adrenocortical Carcinoma Adrenocortical Carcinoma ... Adrenocortical Carcinoma Adrenocortical Carcinoma
CANCER_TYPE_DETAILED Adrenocortical Carcinoma Adrenocortical Carcinoma ... Adrenocortical Carcinoma Adrenocortical Carcinoma
DAYS_TO_COLLECTION 4691 3124 ... 3385 359
FRACTION_GENOME_ALTERED 0.0585 0.4033 ... 0.9279 0.2901
IS_FFPE NO NO ... NO NO
[5 rows x 92 columns]
获取突变信息
先获取突变的分子谱 ID
,一般是 studyId
加上 _mutations
后缀
In [25]: res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId='acc_tcga').response()
...: MP = DataFrame(res.result)
In [26]: MP.molecularProfileId
Out[26]:
0 acc_tcga_rppa
1 acc_tcga_rppa_Zscores
2 acc_tcga_gistic
3 acc_tcga_rna_seq_v2_mrna
4 acc_tcga_rna_seq_v2_mrna_median_Zscores
5 acc_tcga_linear_CNA
6 acc_tcga_methylation_hm450
7 acc_tcga_mutations
8 acc_tcga_rna_seq_v2_mrna_median_all_sample_Zsc...
Name: molecularProfileId, dtype: object
如果要获取所有样本的突变,可以传递以 _all
为后缀的 sampleListId
In [27]: studyId = 'acc_tcga'
...: res = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
...: molecularProfileId = studyId + "_mutations",
...: sampleListId = studyId + "_all",
...: projection = 'DETAILED'
...: ).response()
...: mutations = DataFrame(res.result)
In [28]: mutations.head()
Out[28]:
alleleSpecificCopyNumber aminoAcidChange center ... validationStatus variantAllele variantType
0 None None broad.mit.edu ... Untested T INS
1 None None hgsc.bcm.edu ... Untested T SNP
2 None None broad.mit.edu;ucsc.edu;bcgsc.ca;mdanderson.org ... Untested T SNP
3 None None broad.mit.edu;ucsc.edu;bcgsc.ca;mdanderson.org ... Untested T SNP
4 None None broad.mit.edu ... Untested - DEL
[5 rows x 40 columns]
获取 CNA 数据
获取拷贝数变异数据,需要注意一点,它的 molecularProfileId
并不是都以 gistic
结尾的,应该是流程不一样,有的是以 cna
结尾,例如
In [29]: res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId='nsclc_tcga_broad_2016').response()
...: MP = DataFrame(res.result)
In [30]: MP.molecularProfileId
Out[30]:
0 nsclc_tcga_broad_2016_cna
1 nsclc_tcga_broad_2016_mutations
2 nsclc_tcga_broad_2016_fusion
Name: molecularProfileId, dtype: object
获取全部 CNA
数据
studyId = 'nsclc_tcga_broad_2016'
res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId=studyId).response()
MP = DataFrame(res.result)
molecularProfileId = MP.query('datatype == "DISCRETE"')['molecularProfileId'].values[0]
res = cbioportal.Discrete_Copy_Number_Alterations.getDiscreteCopyNumbersInMolecularProfileUsingGET(
molecularProfileId = molecularProfileId,
sampleListId = studyId + "_all",
projection = 'DETAILED'
).response()
cna = DataFrame(res.result)
# 转换为宽矩阵型
cna_long = cna[['sampleId', 'entrezGeneId', 'alteration']]
cna_table = cna_long.pivot(index='entrezGeneId', columns='sampleId', values='alteration')
2. 可视化
获取完数据之后,可视化也就没什么难度了。但是我们还没介绍 Python
的可视化部分,这里就简单举个例子吧
我们使用的是 plotnine
模块,相当于 ggplo2
的 Python
扩展,语法用起来几乎是一样的,这样看起来也就不会有什么难度了
2.1 样本的突变和拷贝数
from plotnine import *
studyId = 'acc_tcga'
# 拷贝数
res = cbioportal.Molecular_Profiles.getAllMolecularProfilesInStudyUsingGET(studyId=studyId).response()
MP = DataFrame(res.result)
molecularProfileId = MP.query('datatype == "DISCRETE"')['molecularProfileId'].values[0]
res = cbioportal.Discrete_Copy_Number_Alterations.getDiscreteCopyNumbersInMolecularProfileUsingGET(
molecularProfileId = molecularProfileId,
sampleListId = studyId + "_all",
projection = 'DETAILED'
).response()
cna = DataFrame(res.result)
cna_long = cna[['sampleId', 'entrezGeneId', 'alteration']]
cna_table = cna_long.pivot(index='entrezGeneId', columns='sampleId', values='alteration')
# 突变
res = cbioportal.Mutations.getMutationsInMolecularProfileBySampleListIdUsingGET(
molecularProfileId = studyId + "_mutations",
sampleListId = studyId + "_all",
projection = 'DETAILED'
).response()
mutations = DataFrame(res.result)
# 统计样本的突变情况
sample_mut_count = mutations.groupby('sampleId').apply(lambda x: x.shape[0]).to_frame()
sample_mut_count.columns = ['mutation']
sample_mut_count = sample_mut_count.reset_index()
# 统计样本的 cna 情况
sample_cna_count = cna_table.count().to_frame()
sample_cna_count.columns = ['cna']
sample_cna_count = sample_cna_count.reset_index()
# 合并数据
mut_cna = pd.merge(sample_mut_count, sample_cna_count)
绘制图形
(
ggplot(mut_cna)
+ geom_point(aes(x='cna', y='mutation', colour='sampleId'),
show_legend=False, alpha=0.7)
+ theme_bw()
)
样本的突变和拷贝数变异好像有那么点像倒数函数曲线
2.2 样本占比
from collections import Counter
res = cbioportal.Clinical_Data.getAllClinicalDataInStudyUsingGET(studyId='brca_tcga_pub').response()
clin_info = DataFrame(res.result)
clin_long = clin_info[['clinicalAttributeId', 'sampleId', 'value']]
clin_table = clin_long.pivot(index='clinicalAttributeId', columns='sampleId', values='value')
clin_table = clin_table.transpose()
df = clin_table[['ER_STATUS', 'HER2_STATUS', 'PR_STATUS']]
df = df.applymap(lambda x: x if x in ['Positive', 'Negative'] else 'NA')
(
ggplot(df.melt())
+ geom_bar(aes('clinicalAttributeId', fill='value'))
+ scale_fill_manual(values=['#bebada', '#80b1d3', '#fb8072'])
+ theme_classic()
)