10X单细胞(10X空间转录组)聚类分析之树形可视化(TreeCorTreat)

hello,今天给大家分享一个很好的可视化软件,使用基于树的相关屏幕来分析和可视化表型和转录组特征以及多个细胞类型分辨率级别的细胞类型之间的关联。文章在Tree-based Correlation Screen and Visualization for Exploring Phenotype-Cell Type Association in Multiple Sample Single-Cell RNA-Sequencing Experiments,可视化的部分非常nice,下面展示几张效果图。

图片.png

图片.png

好了,开始我们的分享

Overview

多样本的单细胞 RNA-seq 实验越来越多地用于发现可能影响样本表型(例如疾病)的细胞类型及其分子特征。 然而,分析和可视化复杂的细胞类型-表型关联仍然很重要。 TreeCorTreat 是一个开源 R 包,它通过使用基于树的相关屏幕来分析和可视化表型和转录组特征以及多个细胞类型分辨率级别的细胞类型之间的关联来解决这个问题。 使用 TreeCorTreat,可以方便地探索和比较不同的特征类型、表型特征、分析协议和数据集,并评估潜在混杂因素的影响。
TreeCorTreat 将基因表达矩阵(原始计数)、细胞meta数据和样本meta数据作为输入。 它提供了一个完整的pipeline来整合样本之间的数据,识别细胞cluster及其层次结构,在细胞类型比例和基因表达方面评估不同分辨率级别的样本表型和细胞类型之间的关联,并总结和可视化结果树结构的TreeCorTreat 图。 该pipeline由六个功能模块组成:
  • Module 1: Data integration
  • Module 2: Define cell types at multiple resolutions
  • Module 3: Identify association between cell type proportion and sample phenotype
  • Module 4: Identify association between global gene expression and sample phenotype
  • Module 5: Identify differentially expressed genes
  • Module 6: Visualization via TreeCorTreat plot
模块化结构使用可以灵活地跳过某些分析步骤,并用自己的数据或分析功能代替它们。

Input data and data preparation

TreeCorTreat 的输入由三个部分组成:基因表达矩阵 E(G×C,行代表 G 基因,列代表 C 细胞)、细胞级meta数据 M(C×K,其中 K ≥ 2)和样本级meta数据 L(S×J,其中 J ≥ 2)。为了避免冗余并节省存储空间,我们将meta数据拆分为细胞级meta数据和样本级meta数据。细胞级meta数据主要注释细胞级信息,每个细胞必须至少包含 2 列:细胞条形码和相应的样本标识符 (ID)。可以添加可选的第三列以提供单元格类型注释。细胞条形码用于耦合细胞级meta数据 M 和基因表达矩阵 E。样本级meta数据记录样本的感兴趣表型(例如临床结果)和其他相关协变量(例如年龄和性别)。第一列包含唯一的样本 ID,其余列包含表型和协变量。细胞级meta数据和样本级meta数据可以通过唯一的样本 ID 链接。
# load in TreeCorTreat
options(warn=-1)
suppressMessages(library(TreeCorTreat))
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(tidyr))

data(raw_data)
str(raw_data)

Module 1: Data integration

Harmony 算法用于整合来自不同样本的细胞并将它们嵌入到一个共同的低维空间中。 重新编写了 Seurat v3 中的 RunHarmony 和 BuildClusterTree 函数,并将以下步骤包装到一个具有可调参数的独立 treecor_harmony 函数中:库大小归一化、集成特征选择、主成分分析 (PCA)、Harmony 集成、无监督 louvain 聚类、统一流形近似和 投影 (UMAP)、louvain 簇的层次聚类和差异表达分析,以识别细胞类型标记基因以促进注释。
具体而言,对于文库大小标准化,给定细胞的基因计数除以该细胞的总计数,乘以比例因子 (104) 并应用自然对数转换。 整合样本的特征是通过基于每个样本中的方差-均值关系选择高度可变基因 (HVG) 来获得的(使用 SelectIntegrationFeatures 函数),并根据被识别为 HVG 的样本数量对特征进行排序。 默认情况下,选择前 2000 个 HVG 并将其送入下游 PCA 程序。 基于前 20 个 PC 进行和声,并获得修正的和声嵌入。 然后将前 20 个和谐坐标用于下游 louvain 聚类和 UMAP 分析(使用 Seurat 中的 FindClusters 和 RunUMAP 函数)。
# integration
set.seed(12345)
integration <- treecor_harmony(count = raw_data[['count']], sample_meta = raw_data[['sample_meta']], output_dir = getwd())
集成的 Seurat 对象将存储在本地目录中,过滤后的基因表达矩阵或细胞级meta数据可以通过 access_data_seurat 提取,用于下游分析。
# list integration result and data extracted from Seurat object
integrated_data <- access_data_seurat(seurat_obj = integration,output_dir = getwd())
在示例中,没有细胞被过滤掉,我们将更新的细胞级meta数据(附加列包括“细胞类型”和 UMAP 坐标)存储为integrated_cellmeta。
data(integrated_cellmeta)
head(integrated_cellmeta)
##           barcode   sample celltype    UMAP_1     UMAP_2
## 1 HD-17-Su:345974 HD-17-Su        3 -3.393676 -7.0051720
## 2 HD-17-Su:345975 HD-17-Su        2 -5.090366  6.9038575
## 3 HD-17-Su:345976 HD-17-Su        2 -6.407369  6.9599665
## 4 HD-17-Su:345977 HD-17-Su        5 -6.225506  3.5054925
## 5 HD-17-Su:345978 HD-17-Su        1  8.189426  0.8951219
## 6 HD-17-Su:345979 HD-17-Su        9  9.156926 -4.1967079
# data for downstream analysis
sample_meta <- raw_data[['sample_meta']]
count <- raw_data[['count']]
cell_meta <- integrated_cellmeta

Module 2: Define cell types at multiple resolutions

前面的步骤通过 louvain 聚类或基于用户提供的细胞级meta数据(即细胞meta数据中的可选列 3)定义了细胞cluster。 用户可以添加文本标签,根据最高差异表达基因或已知细胞类型标记基因来注释每个细胞cluster的细胞类型。 在这里,我们在 UMAP 上叠加了几个基因标记(例如 CD3D、CD19 和 CD68)以粗略地注释细胞cluster。
# cell clusters
label_text <- integrated_cellmeta %>% group_by(celltype) %>% summarise(UMAP_1 = median(UMAP_1),UMAP_2 = median(UMAP_2))

ggplot(integrated_cellmeta,aes(x=UMAP_1,y=UMAP_2,color = celltype)) + 
  geom_point(size = 0.1) +
  geom_text(data = label_text,aes(x=UMAP_1,y=UMAP_2,label = celltype),color = 'black',size = 5) +
  theme_classic() +
  guides(color = guide_legend(override.aes = list(size = 3))) 
图片.png
# gene markers
genes <- c('CD3D','CD14','CD19','NCAM1','CD4','CD8A',
           'FCGR3A','CD1C','CD68','CD79A','CSF3R',
           'CD33','CCR7','CD38','CD27','KLRD1')

rc <- Matrix::colSums(count)
sub_count <- count[genes,] %>% as.matrix
sub_norm <- log2(t(t(sub_count)/rc*1e6 + 1)) %>% as.matrix

df_marker <- t(sub_norm) %>% data.frame %>% mutate(barcode = rownames(.)) %>% gather(gene,expr,-barcode) %>% inner_join(cell_meta %>% select(barcode,UMAP_1,UMAP_2))

ggplot(df_marker,aes(x = UMAP_1,y = UMAP_2,col = expr)) +
  geom_point(size = 0.01, shape = ".") +
  scale_colour_viridis_c(option = 'C',direction = 1) +
  facet_wrap(~ gene) +
  theme_classic(base_size = 12) +
  theme(legend.position = 'bottom')
图片.png
We roughly categorize 16 cell clusters into 3 large cell-types based on gene markers: B cells (CD19), T cells (CD3D) and Monocytes (CD14).
# new celltype annotation
new_label <- data.frame(old = sort(factor(unique(cell_meta$celltype,levels = 1:16)))) %>% 
  mutate(large_celltype = ifelse(old %in% c(1,9,11,12,13,16),'Mono',
                                 ifelse(old %in% c(2:6,8,10,14,15),'T','B')),
         new = paste0(large_celltype,'_c',old)) %>% select(-large_celltype)
##    old      new
## 1    1  Mono_c1
## 2   10    T_c10
## 3   11 Mono_c11
## 4   12 Mono_c12
## 5   13 Mono_c13
## 6   14    T_c14
## 7   15    T_c15
## 8   16 Mono_c16
## 9    2     T_c2
## 10   3     T_c3
## 11   4     T_c4
## 12   5     T_c5
## 13   6     T_c6
## 14   7     B_c7
## 15   8     T_c8
## 16   9  Mono_c9

Modify cell type annotation

Users can modify the cell type label via modify_label function to update cell type annotations in cell_meta:
# modify cell type names in `cell_meta`
cell_meta <- modify_label(new_label,hierarchy_list = NULL,cell_meta)$cell_meta
## Modify label in cell_meta
head(cell_meta[,c('barcode','sample','celltype')])
##           barcode   sample celltype
## 1 HD-17-Su:345974 HD-17-Su     T_c3
## 2 HD-17-Su:345975 HD-17-Su     T_c2
## 3 HD-17-Su:345976 HD-17-Su     T_c2
## 4 HD-17-Su:345977 HD-17-Su     T_c5
## 5 HD-17-Su:345978 HD-17-Su  Mono_c1
## 6 HD-17-Su:345979 HD-17-Su  Mono_c9
同样,也可以使用此功能更新层次树结构中的细胞类型注释(i.e. modify_label(new_label,hierarchy_list = hierarchy_list,cell_meta = NULL)$hierarchy_list).

Construct hierarchical tree structure

为了便于在多个分辨率下进行关联分析,细胞cluster被进一步分层聚类。 可以使用数据驱动的方法或基于知识的方法来导出树。
该树可以由用户基于先验知识(基于知识)提供,也可以使用层次聚类从数据中导出(数据驱动)。 对于数据驱动方法,通过应用 treecor_harmony 函数(或 Seurat R 包中的 BuildClusterTree)在 PC 空间上构建系统发育树。

Data-driven approach

在数据驱动的方法中,树是通过应用 treecor_harmony 函数(或 Seurat R 包中的 BuildClusterTree )通过 louvain 集群的层次聚类生成的。 具体来说,首先对每个细胞cluster内的细胞求平均 PC 分数,然后构建分层树。 数据驱动的方法可以提供一种无偏的方式来推断底层树结构,但注释每个中间树节点可能具有挑战性。

Knowledge-based approach

在基于知识的方法中,可以通过提供一个字符串来描述不同粒度级别的集群的层次关系,从而根据他们的先验知识来指定树。 树中的叶节点对应于从集成步骤中的 louvain 聚类或来自细胞meta数据的“细胞类型”列中获得的细胞cluster。
# specify string
input_string <- '@All(@B(B_c7),@T(T_c15,T_c14,T_c10,T_c8,T_c6,T_c5,T_c4,T_c3,T_c2),@Monocyte(Mono_c16,Mono_c13,Mono_c12,Mono_c11,Mono_c9,Mono_c1))'

# extract hierarchy from string
hierarchy_structure <- extract_hrchy_string(input_string,special_character = '@', plot = T)
图片.png
For demonstration purpose, we will use knowledge-based hierarchical structure in subsequent analysis.

Module 3: Identify association between cell type proportion and sample phenotype

细胞类型比例的分析被包装到一个独立的函数 treecor_ctprop 中。 这里我们首先为每个树节点定义特征,然后评估特征与样本表型之间的关联。 对于叶节点细胞簇,特征定义为样本中落入该节点的细胞比例。 对于中间父节点或根节点,可以选择在聚合(默认设置)、连接叶节点或连接直接子节点之一中定义特征。 一旦定义了每个节点的特征,将遍历每个树节点以检查其特征与样本表型之间的相关性(或其他汇总统计量)。 计算汇总统计量的方法取决于表型是单变量还是多变量:

Univariate phenotype/Multivariate phenotype analyzed separately:
Pearson correlation (default) with correlation sign
Spearman correlation with correlation sign
Canonical correlation
Chi-squared statistic (via likelihood ratio test between a full model (with phenotype as explanatory variables) and a reduced model (intercept-only)) with coefficient sign
Multivariate phenotype analyzed jointly:
Canonical correlation
Chi-squared statistic (via likelihood ratio test)

The permutation-based p-value can be obtained and adjusted p-value is computed using Benjamini & Yekutieli procedure.
We can first use a kernel density plot to visualize cell density stratified by disease severity.
# density plot on UMAP embeddings
treecor_celldensityplot(cell_meta,
                        sample_meta,
                        row_variable = 'study',
                        col_variable = 'severity',
                        row_combined = F)
图片.png
Next, we can assess the association between cell type proportion and disease severity using treecor_ctprop() pipeline:
# cell type prop pipeline (default)
res_ctprop_full <- treecor_ctprop(hierarchy_structure,
                                  cell_meta,
                                  sample_meta,
                                  response_variable = 'severity',
                                  method = "aggregate",
                                  analysis_type = "pearson",
                                  num_permutations = 100)
names(res_ctprop_full)
第一个元素是一个计算汇总统计量(例如皮尔逊相关性)以及 p 值和调整后的 p 值的表格。 第二个元素是每个细胞类型的 PC 矩阵列表。 但是,在默认设置(聚合)中,比例聚合为向量(一维),因此在这种情况下不进行 PCA。
# extract Pearson correlation
res_ctprop <- res_ctprop_full[[1]] %>% mutate(severity.absolute_cor = abs(severity.pearson))
head(res_ctprop)
##   id severity.method severity.analysis_type severity.pearson severity.p
## 1  1       aggregate                pearson               NA         NA
## 2  2       aggregate                pearson        0.1147058 0.82828283
## 3  3       aggregate                pearson       -0.7614817 0.08080808
## 4  4       aggregate                pearson        0.7603030 0.08080808
## 5  5       aggregate                pearson        0.1147058 0.82828283
## 6  6       aggregate                pearson        0.5039696 0.11111111
##   severity.adjp severity.direction severity.p.sign severity.adjp.sign     x y
## 1            NA                                            6.25 2
## 2     1.0000000                  +              ns                 ns  0.00 1
## 3     0.7781478                  -              ns                 ns  5.00 1
## 4     0.7781478                  +              ns                 ns 12.50 1
## 5     1.0000000                  +              ns                 ns  0.00 0
## 6     0.8321858                  +              ns                 ns  1.00 0
##      label  leaf severity.absolute_cor
## 1      All FALSE                    NA
## 2        B FALSE             0.1147058
## 3        T FALSE             0.7614817
## 4 Monocyte FALSE             0.7603030
## 5     B_c7  TRUE             0.1147058
## 6    T_c15  TRUE             0.5039696
colnames(res_ctprop)
##  [1] "id"                     "severity.method"        "severity.analysis_type"
##  [4] "severity.pearson"       "severity.p"             "severity.adjp"         
##  [7] "severity.direction"     "severity.p.sign"        "severity.adjp.sign"    
## [10] "x"                      "y"                      "label"                 
## [13] "leaf"                   "severity.absolute_cor"
然后可以使用 TreeCorTreat 图来可视化结果。 可以为不同的美学指定变量(例如颜色、大小、alpha)。 后面将详细讨论 TreeCorTreat 图。
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = res_ctprop,
                 response_variable = 'severity',
                 color_variable = 'direction',
                 size_variable = 'absolute_cor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2)
图片.png

Module 4: Identify association between global gene expression and sample phenotype

全局基因表达的分析被包装到一个独立的函数 treecor_expr 中。在这里,还首先为每个树节点定义特征,然后评估全局基因表达-样本表型关联。对于叶子节点,首先将节点中的所有细胞池化以获得节点的伪批量配置文件。然后,使用 R 中的局部加权散点图平滑 (LOWESS) 程序,使用来自所有样本的这些伪批量配置文件来选择高度可变的基因。每个样本中所选高变量基因的伪批量配置文件用作节点的特征向量。这个特征向量是多元的。对于非叶节点,可以选择在聚合(默认设置)、连接叶节点或连接直接子节点之一中定义特征。一旦定义了每个节点的特征,将遍历每个树节点以检查其特征与样本表型之间的相关性(或其他汇总统计量)。可以选择规范相关或 F 统计量(Pillai-Bartlett,将完整模型(以表型作为解释变量)与简化模型(仅截取)进行比较)。
The following code illustrates an example of association between gene expression and disease severity, using default setting (aggregate).
# gene expression pipeline
res_expr_full <- treecor_expr(count,
                              hierarchy_structure,
                              cell_meta,
                              sample_meta,
                              response_variable = 'severity',
                              method = 'aggregate',
                              analysis_type = 'cancor',
                              num_permutations = 100)
By running treecor_expr, it results in two elements: the first element is a summary table of computed summary statistic (e.g. canonical correlation) along with p-value and adjusted p-value for each cell type; and the second element is a list of PC matrices for each cell type.
# extract canonical correlation
res_expr <- res_expr_full[[1]]
head(res_expr)
##   id severity.cancor severity.p severity.adjp severity.direction
## 1  1       0.8942939 0.03030303     0.4360897                  +
## 2  2       0.6428883 0.15151515     0.9911129                  +
## 3  3       0.9297493 0.03030303     0.4360897                  +
## 4  4       0.8324347 0.03030303     0.4360897                  +
## 5  5       0.6428883 0.15151515     0.9911129                  +
## 6  6       0.8807827 0.04040404     0.4845441                  +
##   severity.p.sign severity.adjp.sign     x y    label  leaf
## 1             sig                 ns  6.25 2      All FALSE
## 2              ns                 ns  0.00 1        B FALSE
## 3             sig                 ns  5.00 1        T FALSE
## 4             sig                 ns 12.50 1 Monocyte FALSE
## 5              ns                 ns  0.00 0     B_c7  TRUE
## 6             sig                 ns  1.00 0    T_c15  TRUE

The summary table is used to generate TreeCorTreat plot:
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2)
图片.png
除了汇总表,还可以通过使用来自基因表达分析的 PC 矩阵分别评估每种细胞类型。
# extract PCA list
pc_expr <- res_expr_full[[2]]
names(pc_expr)

##  [1] "All"      "B"        "T"        "Monocyte" "B_c7"     "T_c15"   
##  [7] "T_c14"    "T_c10"    "T_c8"     "T_c6"     "T_c5"     "T_c4"    
## [13] "T_c3"     "T_c2"     "Mono_c16" "Mono_c13" "Mono_c12" "Mono_c11"
## [19] "Mono_c9"  "Mono_c1"

# extract PCA matrix corresponding to 'T' tree node
pc_expr[['T']]

##                    PC1        PC2
## HD-17-Su    -44.684639   2.823131
## HD-19-Su    -51.774059   4.264000
## HD-23-Su    -51.352355 -23.472410
## HD-30-Su    -39.098933 -37.230307
## Se-137.1-Su  73.681369  47.520847
## Se-178.1-Su  11.966558  20.281313
## Se-180.1-Su   6.457736  53.349695
## Se-181.1-Su  94.804324 -67.536270
PC 矩阵列表由细胞级meta数据的细胞类型列中指定的细胞类型名称命名。 因此,可以提取给定细胞类型(例如 T 细胞类别)的每个样本的 PC 坐标,并使用 treecor_samplepcaplot 可视化样本级 PCA 图。
# PCA plot (using `T` node)
treecor_samplepcaplot(sample_meta,
                      pca_matrix = pc_expr[['T']],
                      response_variable = 'severity',
                      font_size = 10,
                      point_size = 3)
图片.png
可以观察到样本的严重性沿着虚线指示的方向从健康到严重变化,这对应于从 CCA 推断出的最佳轴,该轴最大化了嵌入空间中严重性和样本坐标之间的相关性。

Module 5: Identify differentially expressed genes

差异表达基因 (DEG) 的分析包含在函数 treecor_deg() 中。 在这里,我们首先像以前一样使用“聚合”方法计算每个树节点的假体基因表达谱。 换句话说,对于每个节点,来自其descendants的所有细胞都被汇集并聚合到每个样本中的伪批量配置文件中。 Pseudobulk 配置文件按库大小标准化。 然后使用 Limma 进行差异表达分析:
  • 单独分析单变量表型/多变量表型:通过使用协变量调整对表型的基因表达进行回归来拟合 limma 模型。 将为每个节点报告通过用户指定的错误发现率 (FDR) 截断值(默认截断值 = 0.05)的 DEG,并将其与日志折叠更改、t 统计、p 值和 FDR 一起保存到 csv 文件中。
  • 联合分析多变量表型:使用数据驱动方法 (PC1) 或用户指定的权重(使用权重的多个性状的线性组合)将多个性状组合成单变量变量。 然后将聚合的性状作为单变量表型进行分析。
# DEG pipeline
res_deg <- treecor_deg(count,
                       hierarchy_structure,
                       cell_meta,
                       sample_meta,
                       response_variable = 'severity')
names(res_deg)

## [1] "dge.summary"   "dge.ls"        "pseudobulk.ls"

# 1st: summary of number of DEGs 
head(res_deg$dge.summary) # or head(res_deg[[1]])

##      label severity.num_deg     x y id  leaf
## 1      All                0  6.25 2  1 FALSE
## 2        B                0  0.00 1  2 FALSE
## 3        T              993  5.00 1  3 FALSE
## 4 Monocyte                2 12.50 1  4 FALSE
## 5     B_c7                0  0.00 0  5  TRUE
## 6    T_c15              456  1.00 0  6  TRUE

# 2nd: extract DEGs for a specific tree node using cell type name
# (1) check phenotypes
names(res_deg$dge.ls)

## [1] "severity"

# (2) choose a phenotype
severity.dge.ls <- res_deg$dge.ls[['severity']]
# (3) extract DEGs of T cell category
head(severity.dge.ls[['T']]) 

##          severitySe.logFC severitySe.t severitySe.p severitySe.fdr
## FADS1            2.221299    13.884431 2.351480e-07    0.004165177
## JCHAIN           3.036823    12.451043 5.964215e-07    0.005282207
## INPPL1           1.755148    10.250902 3.058248e-06    0.017470015
## APOBEC3H         2.250345     9.908317 4.052855e-06    0.017470015
## GPR68            1.693644     9.515158 5.657245e-06    0.017470015
## CORO1C           1.813891     9.220954 7.316652e-06    0.017470015

# 3rd: extract sample-level pseudobulk for a specific tree node using cell type name
names(res_deg$pseudobulk.ls) # or names(res_deg[[3]])

##  [1] "All"      "B"        "T"        "Monocyte" "B_c7"     "T_c15"   
##  [7] "T_c14"    "T_c10"    "T_c8"     "T_c6"     "T_c5"     "T_c4"    
## [13] "T_c3"     "T_c2"     "Mono_c16" "Mono_c13" "Mono_c12" "Mono_c11"
## [19] "Mono_c9"  "Mono_c1"

head(res_deg$pseudobulk.ls[['T']]) 

##          HD-17-Su HD-19-Su  HD-23-Su HD-30-Su Se-137.1-Su Se-178.1-Su
## A1BG     5.493462 5.506674 5.6802740 5.876471   4.7791551   5.3461643
## A1BG-AS1 2.652148 1.854222 2.3063366 2.420313   2.9822341   2.2278498
## A2M      0.000000 0.000000 0.1968054 0.000000   0.6555113   0.3599841
## A2M-AS1  1.814409 1.413839 1.2995822 1.852688   3.5050187   2.9638223
## A4GALT   0.000000 0.000000 0.3699580 0.000000   0.0000000   0.0000000
## AAAS     4.653329 4.235223 4.1063058 4.665398   5.1026623   4.7283047
##          Se-180.1-Su Se-181.1-Su
## A1BG        5.516765    4.809994
## A1BG-AS1    0.000000    2.002053
## A2M         0.000000    1.001369
## A2M-AS1     3.578594    3.809896
## A4GALT      0.000000    0.000000
## AAAS        4.315328    4.250520
treecor_deg 返回一个包含三个元素的列表:第一个元素总结了每种细胞类型的 DEG 数量,第二个元素对应于每个表型(response_variable)的每个细胞cluster中识别的特定 DEG,第三个元素存储样本级假体基因表达矩阵( 对于每种细胞类型,每一行是一个基因,每一列是一个样本)。
根据是单独分析还是联合分析多变量表型,汇总表中会有额外的列(即第一个元素 - dge.summary),记录关于不同表型的 DEG 数量和 DEG 列表中的其他元素(即第二个元素 - dge.ls)。
We can visualize the number of DEGs using TreeCorTreat plot:
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = res_deg$dge.summary,
                 response_variable = 'severity',
                 color_variable = 'num_deg',
                 size_variable = 'num_deg',
                 alpha_variable = NULL,
                 annotate_number = T,
                 annotate_number_column = 'num_deg',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2)
图片.png
此外,可以使用 treecor_degheatmap 通过指定 n 和对数折叠变化的方向/符号来探索前 n 个 DEG 的基因表达模式:
# heatmap for top 10 DEGs with positive log fold change (enriched in Severe)
treecor_degheatmap(sample_meta,
                   pseudobulk = res_deg$pseudobulk.ls[['T']],
                   deg_result = res_deg$dge.ls$severity[['T']],
                   top_n = 10,
                   deg_logFC = "positive",
                   annotation_col = c('severity','sex','age'),
                   annotation_colors = list(severity = c('HD' = 'green','Se' = 'red'),
                                            sex = c('F' = 'purple','M' = 'blue')))
图片.png
# heatmap for top 10 DEGs with negative log fold change (enriched in Healthy)
treecor_degheatmap(sample_meta,
                   pseudobulk = res_deg$pseudobulk.ls[['T']],
                   deg_result = res_deg$dge.ls$severity[['T']],
                   top_n = 10,
                   deg_logFC = "negative",
                   annotation_col = c('severity','sex','age'),
                   annotation_colors = list(severity = c('HD' = 'green','Se' = 'red'),
                                            sex = c('F' = 'purple','M' = 'blue')))
图片.png
# Combined two plots above
treecor_degheatmap(sample_meta,
                   pseudobulk = res_deg$pseudobulk.ls[['T']],
                   deg_result = res_deg$dge.ls$severity[['T']],
                   top_n = 10,
                   deg_logFC = "both",
                   annotation_col = c('severity','sex','age'),
                   annotation_colors = list(severity = c('HD' = 'green','Se' = 'red'),
                                            sex = c('F' = 'purple','M' = 'blue')))
图片.png

Module 6: Visualization

TreeCorTreat 提供了多种函数来可视化中间和最终结果。 例如,可以使用 treecor_celldensityplot 来显示不同样本组的细胞类型比例(模块 3)。 treecor_samplepcaplot 将样本投影到共享的低维嵌入并找到最佳轴(模块 4)。treecor_degheatmap 显示给定细胞类型的最高差异表达基因(模块 5)。 特别是,为了总结最终结果,开发了一个多功能函数 treecortreatplot 以我们称为“TreeCorTreat plot”的格式显示多分辨率细胞类型-表型关联。 该图由显示细胞类型层次结构的树状图和显示每个树节点分析结果的信息层组成。

Tree skeleton representation

为了显示树,可以使用直线、经典的角度弯曲表示或二次贝塞尔曲线表示来连接节点。 可以根据自己的喜好选择线条美感,例如线型(例如实线或虚线)和线条颜色。
Edge type
  • Straight line
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: edge_path_type
                 edge_path_type = "link")
图片.png
  • Classical angle bend
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.6,
                 nonleaf_point_gap = 0.3,
                 ## parameter: edge_path_type
                 edge_path_type = "elbow")
图片.png
  • Quadratic Bezier curves
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: edge_path_type
                 edge_path_type = "diagonal")
图片.png

Line aesthetics: line type and color

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: line_type
                 line_type = 'dashed',
                 ## parameter: line_color
                 line_color = 'darkgreen')
image.png

Modify label or circle position

In some cases where label length (cell type name) is long or circle size is quite large, default label or circle position on the non-leaf part (left) may not work well. Therefore, we introduce nonleaf_label_pos and nonleaf_point_gap parameters which allow users to manually fix the label position or gap between multivariate phenotype. A tip is that if one have k phenotypes to be analyzed separately, one can first choose a proper gap between points (i.e. nonleaf_point_gap = 0.1) and let nonleaf_label_pos = (i.e. let k=3, nonleaf_label_pos = 0.1*3+0.2 = 0.5).In the above examples, we specify gap between points at 0.2 and label position at 0.4.

Leaf representation

For external node configuration, TreeCorTreat plot aligns cell types in the rows and phenotypes in the columns. There are three different ways to visualize the results: balloon plot, heatmap and bar plot. Users can use color (color_variable), size (size_variable) and transparency (alpha_variable) to encode different information.

Balloon plot

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: plot_type
                 plot_type = 'balloon')
图片.png
In the above balloon plot, the size of the balloon reflects the magnitude of canonical correlation between global gene expression profile and disease severity and color as well as transparency correspond to p-value significance.
Alternatively, we can use color to represent canonical correlation and size/transparency to represent p-value significance using the following code:
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'p.sign',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: plot_type
                 plot_type = 'balloon')
图片.png

Heatmap

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: plot_type
                 plot_type = 'heatmap')
图片.png

Barplot

treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'p.sign',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 ## parameter: plot_type
                 plot_type = 'bar')
图片.png

More advanced plotting figures

Annotate numbers

Users can specify which variable/color to be used in annotating TreeCorTreat plot via annotate_number, annotate_number_column and annotate_number_color arguments.The following code demonstrates an example of overlaying canonical correlation on top of TreeCorTreat:
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 annotate_number = T,
                 annotate_number_column = 'cancor',
                 annotate_number_color = 'black')
图片.png

Modify legend title

If we want to modify legend title, one way is to modify column names of annotated_df directly before running treecortreatplot. The column names for summary statistic and statistical significance are usually defined in response_variable.colname format (e.g. severity.cancor, severity.direction (only exists in cell type proportion analysis), severity.p, severity.p.sign, etc). Thus, it’s crucial to check the column names are in the correct format before generating TreeCorTreat plot.
Suppose we hope to modify names for both response variables (‘severity’) and legends (i.e. ‘cancor’ and ‘p.sign’) by capitalizing the first letter in ‘severity’ and ‘cancor’ and replace ‘p.sign’ by ‘P-value significance’.
## modify column names
colnames(res_expr)

##  [1] "id"                 "severity.cancor"    "severity.p"        
##  [4] "severity.adjp"      "severity.direction" "severity.p.sign"   
##  [7] "severity.adjp.sign" "x"                  "y"                 
## [10] "label"              "leaf"

res_expr_new <- res_expr
colnames(res_expr_new) <- gsub('severity','Severity',colnames(res_expr_new))
colnames(res_expr_new) <- gsub('cancor','Cancor',colnames(res_expr_new))
colnames(res_expr_new) <- gsub('\\.p.sign','\\.P-value significance',colnames(res_expr_new))

## check modified column names
colnames(res_expr_new)

##  [1] "id"                            "Severity.Cancor"              
##  [3] "Severity.p"                    "Severity.adjp"                
##  [5] "Severity.direction"            "Severity.P-value significance"
##  [7] "Severity.adjp.sign"            "x"                            
##  [9] "y"                             "label"                        
## [11] "leaf"

## visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr_new,
                 response_variable = 'Severity',
                 color_variable = 'P-value significance',
                 size_variable = 'Cancor',
                 alpha_variable = 'P-value significance',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon')
图片.png

Modify label colors

Users can also modify label colors to highlight cell types of interest. The advanced setting can be configured by advanced_list argument by providing label_info with fixed column names: ‘label’ (cell type label) and ‘label.color’ (a color vector). For example, we can highlight cell types that have significant global gene expression-disease severity correlation in red:
label_info <- data.frame(label = res_expr$label,
                         label.color = ifelse(res_expr$severity.p.sign == 'ns','black','red'))
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 advanced_list = list(label_info = label_info))
图片.png

Create your own mapping from levels to aesthetic values

Users can customize mapping from levels in the data (e.g. categorical variables) to certain aesthetic values (e.g. color/size/transparency) via breaks and values (alike scale_manual() in ggplot2). This can be also achieved by providing color_info or size_info or alpha_info in advanced_list argument. The following code illustrates an example of changing color for statistical significance as blue (non-significant) and red (significant) and modifying transparency:
advanced_list <- list(color_info = data.frame(breaks = c('ns','sig'), values = c('blue','red')),
                      alpha_info = data.frame(breaks = c('ns','sig'), values = c(0.5,1)))
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 advanced_list = advanced_list)
图片.png

Choose a different color palette

For categorical variable, color palette can be specified by color_info in advanced options; For continuous variable, we provide different palettes from RColorBrewer package (default is ‘Spectral’) and can be modified via palette as a element in the advanced_list.
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 advanced_list = list(palette = 'PiYG'))
图片.png

Configure layout

The layout of TreeCorTreat plot can be modified via layout_widths in the advanced_list parameter, which specifies the relative left-to-right ratio (or non-leaf: leaf ratio).
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon',
                 advanced_list = list(layout_widths = c(2,1)))
图片.png

Save TreeCorTreat plots

TreecorTreat plots can be saved into pdf, png or other graphical formats.
png('TreeCorTreat_plot.png', height = 8, width = 15, res = 300, units = 'in')
treecortreatplot(hierarchy_structure,
                 annotated_df = res_expr,
                 response_variable = 'severity',
                 color_variable = 'cancor',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.4,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon')
dev.off()

TreeCorTreat supports versatile multi-resolution analysis

为了在多个分辨率下分析细胞类型-表型关联,细粒度细胞cluster(children)逐渐合并为更大的细胞cluster(parents)。 当通过合并两个或多个子节点创建父节点时,来自子节点的信息可以以多种不同方式组合,具有不同的含义。 组合子节点的两种基本方法是“聚合”和“串联”。 “聚合”方法将来自子节点的所有细胞汇集到一个用于表示父节点的伪批量样本中。 在“串联”方法中,首先将每个子节点内的细胞池化,以形成该节点的伪批量样本。 一个使用从其伪批量样本(例如,高度可变的基因)中提取的特征或特征向量来表示每个子节点。 然后通过将来自所有子节点的特征向量连接成一个更长的向量来表示父节点。
To support different needs of users, TreeCorTreat provides three different ways to combine child nodes: ‘aggregate’ (default), ‘concatenate leaf nodes’, and ‘concatenate immediate children’.

Aggregate

The default setting for both treecor_expr and treecor_ctprop is aggregation (see previous examples), where we aggregate raw read counts for gene expression or number of cells for cell composition for non-leaf nodes.

Concatenate leaf nodes

In the ‘concatenate leaf nodes’ approach, feature vectors of all terminal leaf nodes derived from a node are concatenated into a long vector to serve as the feature vector of the node. For global gene expression analysis, this will result in a concatenated vector consisting of pseudobulk expression of highly variable genes obtained from each leaf node. For cell type proportion analysis, this will result in a vector of cell type proportions of the leaf nodes.
# gene expression pipeline; concat_leaf
res_expr_concatLeaf <- treecor_expr(count,
                                    hierarchy_structure,
                                    cell_meta,
                                    sample_meta,
                                    response_variable = 'severity',
                                    method = 'concat_leaf',
                                    num_permutations = 100)$canonical_corr
head(res_expr_concatLeaf)

##   id severity.cancor severity.p severity.adjp severity.direction
## 1  1       0.7803956 0.03030303     0.4360897                  +
## 2  2       0.6428883 0.15151515     0.9911129                  +
## 3  3       0.8556124 0.03030303     0.4360897                  +
## 4  4       0.6743079 0.03030303     0.4360897                  +
## 5  5       0.6428883 0.15151515     0.9911129                  +
## 6  6       0.8807827 0.04040404     0.4845441                  +
##   severity.p.sign severity.adjp.sign     x y    label  leaf
## 1             sig                 ns  6.25 2      All FALSE
## 2              ns                 ns  0.00 1        B FALSE
## 3             sig                 ns  5.00 1        T FALSE
## 4             sig                 ns 12.50 1 Monocyte FALSE
## 5              ns                 ns  0.00 0     B_c7  TRUE
## 6             sig                 ns  1.00 0    T_c15  TRUE

Concatenate immediate children

In the ‘concatenate immediate children’ approach, a node’s immediate children are first identified. The feature vector of each immediate child is obtained using the ‘aggregate’ approach. Then feature vector of target node is obtained by concatenating the feature vectors of its immediate children.

gene expression pipeline; concat_immediate_children

res_expr_concatImmChild <- treecor_expr(count,
                                        hierarchy_structure,
                                        cell_meta,
                                        sample_meta,
                                        response_variable = 'severity',
                                        method = 'concat_immediate_children',
                                        num_permutations = 100)$canonical_corr
head(res_expr_concatImmChild)

##   id severity.cancor severity.p severity.adjp severity.direction
## 1  1       0.8504976 0.03030303     0.4360897                  +
## 2  2       0.6428883 0.15151515     0.9911129                  +
## 3  3       0.8556124 0.03030303     0.4360897                  +
## 4  4       0.6743079 0.03030303     0.4360897                  +
## 5  5       0.6428883 0.15151515     0.9911129                  +
## 6  6       0.8807827 0.04040404     0.4845441                  +
##   severity.p.sign severity.adjp.sign     x y    label  leaf
## 1             sig                 ns  6.25 2      All FALSE
## 2              ns                 ns  0.00 1        B FALSE
## 3             sig                 ns  5.00 1        T FALSE
## 4             sig                 ns 12.50 1 Monocyte FALSE
## 5              ns                 ns  0.00 0     B_c7  TRUE
## 6             sig                 ns  1.00 0    T_c15  TRUE

# compare 3 methods (non-leaf nodes)
summary_ls <- list(res_expr %>% filter(!leaf) %>% select(id,label,severity.cancor) %>% rename(aggregate = severity.cancor),
                   res_expr_concatLeaf %>% filter(!leaf) %>% select(id,label,severity.cancor) %>% rename(concatLeaf = severity.cancor),
                   res_expr_concatImmChild %>% filter(!leaf) %>% select(id,label,severity.cancor) %>% rename(concatImmChild = severity.cancor))
summary <- Reduce(inner_join,summary_ls) 
summary

##   id    label aggregate concatLeaf concatImmChild
## 1  1      All 0.8942939  0.7803956      0.8504976
## 2  2        B 0.6428883  0.6428883      0.6428883
## 3  3        T 0.9297493  0.8556124      0.8556124
## 4  4 Monocyte 0.8324347  0.6743079      0.6743079
For some broad cell categories like B, T and Monocytes, the canonical correlation of leaf concatenation and immediate children concatenation are the same because the immediate children set is same as its corresponding leaf nodes. Depending on different tree hierarchies, three approaches may have different results and implications. This can also be visualized using TreeCorTreat plot:
# aggregate
colnames(res_expr) <- gsub('severity','Severity\n(Agg)',colnames(res_expr))
# concat_leaf
colnames(res_expr_concatLeaf) <- gsub('severity','Severity\n(Leaf)',colnames(res_expr_concatLeaf))
# concat_immediate_children
colnames(res_expr_concatImmChild) <- gsub('severity','Severity\n(ImmChild)',colnames(res_expr_concatImmChild))

# combine three approaches
res_combined <- Reduce(inner_join,list(res_expr,res_expr_concatLeaf,res_expr_concatImmChild))

# plot
treecortreatplot(hierarchy_structure,
                 annotated_df = res_combined,
                 response_variable = c('Severity\n(Agg)','Severity\n(ImmChild)','Severity\n(Leaf)'),
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.8,
                 nonleaf_point_gap = 0.2,
                 plot_type = 'balloon')
图片.png

Analysis of multivariate outcomes

当有多个表型性状时,可以将每个性状作为单变量表型单独分析,也可以作为多变量表型联合分析。 为了分析多变量表型与细胞类型比例或全局基因表达之间的关联,CCA 用于计算表型与细胞类型特征(即细胞类型比例或基因表达)之间的典型相关性。

Separate evaluation

# individually
multi_expr_sep <- treecor_expr(count,
                               hierarchy_structure,
                               cell_meta,
                               sample_meta,
                               response_variable = c('severity','age'),
                               num_permutations = 100)$canonical_corr
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = multi_expr_sep,
                 response_variable = c('severity','age'),
                 separate = T,
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.5,
                 nonleaf_point_gap = 0.2)
图片.png

Joint evaluation

# jointly
multi_expr_joint <- treecor_expr(count,
                                 hierarchy_structure,
                                 cell_meta,
                                 sample_meta,
                                 response_variable = c('severity','age'),
                                 separate = F,
                                 num_permutations = 100)$canonical_corr
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = multi_expr_joint,
                 response_variable = c('severity','age'),
                 separate = F,
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.2,
                 nonleaf_point_gap = 0.1)
图片.png
为了分析差异基因表达,首先使用主成分分析或使用指定权重的性状线性组合将多变量表型转换为单变量表型。 转化后的单变量表型结合了来自多个性状的信息,然后用于运行差异表达分析。
# jointly
multi_deg_joint <- treecor_deg(count,
                               hierarchy_structure,
                               cell_meta,
                               sample_meta %>% mutate(severity = ifelse(severity == "HD", 0,1)),
                               response_variable = c('severity','age'),
                               separate = F,
                               save_as_csv = F)$dge.summary
head(multi_deg_joint)

##      label combined_phenotype.num_deg     x y id  leaf
## 1      All                          1  6.25 2  1 FALSE
## 2        B                          0  0.00 1  2 FALSE
## 3        T                        700  5.00 1  3 FALSE
## 4 Monocyte                          0 12.50 1  4 FALSE
## 5     B_c7                          0  0.00 0  5  TRUE
## 6    T_c15                          0  1.00 0  6  TRUE

Adjusting for covariates in TreeCorTreat analysis

TreeCorTreat is capable of handling covariates in the analysis, allowing users to adjust for potential confounders or unwanted technical variation. For example, instead of viewing age as a phenotype, one can also view it as a covariate and ask which cell type features are associated with disease severity after accounting for age.
# adjust for age
expr_adjusted <- treecor_expr(count,
                              hierarchy_structure,
                              cell_meta,
                              sample_meta,
                              response_variable = 'severity',
                              formula = '~age',
                              num_permutations = 100)$canonical_corr
# visualize
treecortreatplot(hierarchy_structure,
                 annotated_df = expr_adjusted,
                 response_variable = 'severity',
                 color_variable = 'p.sign',
                 size_variable = 'cancor',
                 alpha_variable = 'p.sign',
                 font_size = 12,
                 nonleaf_label_pos = 0.2,
                 nonleaf_point_gap = 0.1)
图片.png

可视化相当不错,值得大家借鉴

生活很好,有你更好

你可能感兴趣的:(10X单细胞(10X空间转录组)聚类分析之树形可视化(TreeCorTreat))