10X单细胞(10X空间转录组)分析之学习和可视化抽象细胞特征和数据分组(Multiscale PHATE)

hello,周三了,一周的最黄金事件,时间很快,需要珍惜,功夫熊猫里面有一句台词,放弃,不放弃;做面条,不做面条。你患得患失,太在意从前,又太担心将来。有句话说的好:昨天是段历史,明天是个谜团,而今天是天赐的礼物。像珍惜礼物那样珍惜今天。

今天给大家分享一种可以在所有粒度级别上学习和可视化抽象细胞特征和数据分组的方法----Multiscale PHATE,文章在Multiscale PHATE identifies multimodal signatures of COVID-19,发表在Nature Biotechnology,IF 54分,相当好的方法,推荐给大家,也可以参考一下我之前写的文章10X单细胞降维分析之PHATE。

图片.png

研究背景

  • 当前用于降维和数据探索的工具,包括 t 分布随机邻域嵌入 (t-SNE)、统一流形逼近和投影 (UMAP) 和主成分分析 (PCA),仅显示数据的a single level of granularity。(当然,实际运用仍然有缺陷)。
  • Multiscale PHATE,一种可以在所有精度级别学习和可视化抽象细胞特征和数据分组的方法。 算法基于称为扩散凝聚(diffusion condensation)的动态拓扑过程,该过程将数据点缓慢凝聚到局部重心,以形成自然的、数据驱动的跨精度分组。这种粗粒度过程通过允许细胞在连续的集中步骤过程中自然地聚集在一起,不断学习底层数据集的拓扑结构,从而允许探索其他方法无法揭示的更连续的精度范围

各种降维方法的优劣我之前分享过,现在把PPT放在下面,大家自己查看

图片.png

图片.png

图片.png

图片.png

Multiscale PHATE algorithm

Multiscale PHATE 将一种称为扩散凝聚(diffusion condensation)的数据粗粒度方法与一种称为 PHATE 的保持流形的降维方法相结合,以产生多粒度的可视化和高维生物数据cluster。 Multiscale PHATE 算法可以分解为四个步骤:

  • 1、compute a manifold-intrinsic, diffusion potential representation that learns the nonlinear biological manifold as done in PHATE(计算一个流形固有的扩散势表示,该表示学习非线性生物流形,参考文章10X单细胞降维分析之PHATE)
  • coarse grain this diffusion potential using a fast diffusion condensation process(降低精度)
  • select meaningful resolutions for downstream analysis with a gradient-based approach(迭代到一定的程度,合并高精度的细胞称为一个整体)
  • visualize condensed diffusion potential coordinates at selected scales via metric multidimensional scaling (MMDS) and analyze coarser-grain resolutions to obtain multiscale clusters(通过度量多维尺度 (MMDS) 可视化选定尺度的凝聚扩散势坐标,并分析粗粒度分辨率以获得多尺度clusters)。
    图片.png
Multiscale PHATE 首先创建原始数据的diffusion potential representation U,
(1)first, a distance matrix is calculated between all cells based on their ambient measurements. Distance matrix is converted into affinity matrix using anadaptive-bandwidth Gaussian kernel function so that similarity between two cells decreases exponentially with their distance.(是不是感觉很熟悉,类似于KNN)
(2)Next, is row normalized to obtain the diffusion operator , representing the probability distribution of transitioning from one cell to another in a single step. This diffusion operator is raised to tD, the PHATE optimal diffusion timescale as computed by von Neumann entropy, to simulate a tD-step random walk over the data graph.(算法的内容还是有点难以理解)。
(3)Finally, by taking logarithm of PtD , we calculate the diffusion potential of the data.

以前的工作表明,在 PHATE 中计算的这种内部表示有效地学习了复杂生物数据集的非线性几何,并且可以使用 MMDS 在二维或三维中快速可视化。 (PHATE在降维方面确实有比较好的一面)Multiscale PHATE uses this diffusion potential representation as the substrate for our diffusion condensation process.正如扩散势计算所做的那样,扩散凝聚在每次迭代时使用来自扩散势空间中细胞位置的fixed-bandwidth Gaussian kernel function计算diffusion operator Pt。使用fixed bandwidth可以衡量计算细胞-细胞亲和性的局部性。该diffusion operator Pt 应用于扩散势 Ut充当扩散滤波器,有效地用其扩散邻居的加权平均值替换点的坐标。 当两个细胞之间的距离低于距离阈值时,细胞将合并在一起,表示它们属于同一个clusters。 然后迭代地重复这个过程,直到所有细胞都折叠成一个cluster

通过对扩散势进行去噪,Multiscale PHATE 解决了原始diffusion condensation的两个缺点。
  • Diffusion condensation in its original form is not effective at learning or visualizing the nonlinear geometry of biological datasets and is prone to(容易发生) condensing points off the data manifold


    图片.png
  • 通过首先通过扩散势计算学习非线性数据流形并将其输入到扩散凝聚中,不仅有效地学习了复杂数据集的非线性几何,而且在感兴趣的分辨率下快速可视化和learn clusters。
为了识别有意义的尺度,应用了基于梯度的方法,确定了用于下游分析的condensation process的稳定分辨率。 这些分辨率的可视化是通过计算潜在距离矩阵DUt 来实现的,使用Ut 中的行对之间的距离(也就是细胞之间的距离)。最后,通过执行 MMDS 来获得多尺度 PHATE 可视化,以保留 DUt 内的二维或三维距离并准备可视化。 因此,在 Multiscale PHATE 中,我们不仅能够沿着数据流形计算连贯的数据拓扑,还能够快速可视化condensation process的中间层。 使用已知cluster的stochastic block model,表明,随着将越来越多的噪声添加到模型中,使用扩散势初始化的diffusion condensation 优于环境测量空间上的diffusion condensation。

有关 Multiscale PHATE 的普遍性、可扩展性和可重复性的更多详细信息,如下图。

图片.png

图片.png

方法之间的比较

  • 衡量标准,adjusted Rand index (ARI) and F1 scores(关于ARI,参考文章调整兰德系数(Adjusted Rand index,ARI)的计算、关于F1 scores,参考文章机器学习中的F1-score

Multiscale PHATE的优点 1 ,preserved local and global distances(单细胞数据,这个好像UMAP的特点。

在几乎所有生物噪声范围内,Multiscale PHATE 的表现都优于其他方法。 特别是,Multiscale PHATE 在可视化具有高度噪声的数据方面具有明显优势。Across our comparisons, Multiscale PHATE similarly performed as well or better than other visualization modalities, especially as noise increased within the dataset。


图片.png

图片.png
Multiscale clusters accurately captured established groupings of data.

(1)噪音合成数据和two- and three-layer hierarchical stochastic block models


图片.png
当然了,文章介绍了一些方法的实际运用,尤其在SARS-CoV-2 单细胞数据中的运用,取得了很好的效果,这个我们就不过多介绍了。

Discussion(这算法真的挺难的)

在这里,提出了一种多尺度数据探索技术,用于可视化、聚类和比较大规模数据集,填补了生物数据探索的关键空白。Multiscale PHATE 发现了可预测临床结果的不同尺度的数据分组。生物数据自然包含多粒度结构。然而,大多数分析方法,无论是聚类还是降维算法,通常只关注单一级别的分辨率,并没有提供探索不同尺度的系统方法。层次聚类是一种可以提供一定分辨率的方法。然而,由于层次聚类方法(例如,Louvain)中发生的不断合并,错过了许多分辨率级别,并且没有概括生物学相关的粒度级别。相比之下,Multiscale PHATE 提供了一种基于流形学习的快速技术,用于通过了解数据拓扑来揭示结构和特征的连续分辨率。分析表明,多尺度 PHATE 可以与其他技术相结合,例如 MELD 和互信息 (DREMI),以提供对生物过程的深入和详细的见解。借助 Multiscale PHATE,这些工具允许用户找到自然捕捉患者之间显著差异的解决方案,跨尺度分离致病性和保护性细胞亚群,并识别与疾病相关的关键标志物。

Methods(痛苦的数学)

图片.png

图片.png

图片.png

最后来看看示例代码,链接在Multiscale PHATE

算法将降维技术 PHATE 与多粒度分析工具diffusion condensation相结合。 首先使用 PHATE 计算非线性扩散流形。 然后,扩散凝聚利用这个流形内在扩散空间,将数据点缓慢凝聚到局部重心,形成跨多个粒度的自然、数据驱动的分组。 然后可以查看这些粒度

使用梯度分析,观察diffusion condensation过程连续迭代期间数据密度的变化,我们可以确定分层树的稳定分辨率以进行下游分析。 有了这些稳定性信息,我们可以在多个分辨率下切割层次树,以生成跨粒度的可视化和集群,用于下游分析。


图片.png

通过识别多种分辨率,Multiscale PHATE 使用户能够与其数据交互并放大感兴趣的细胞子集,以揭示有关细胞类型和子类型的越来越精细的信息。 当与其他用于高维数据分析的计算算法(如 MELD 和 DREMI)结合使用时,Multiscale PHATE 能够提供对生物过程的深入而详细的见解。


图片.png

安装

pip install --user git+https://github.com/KrishnaswamyLab/Multiscale_PHATE

Quick Start

import multiscale_phate
mp_op = multiscale_phate.Multiscale_PHATE()
mp_embedding, mp_clusters, mp_sizes = mp_op.fit_transform(X)

# Plot optimal visualization
scprep.plot.scatter2d(mp_embedding, s = mp_sizes, c = mp_clusters,
                      fontsize=16, ticks=False,label_prefix="Multiscale PHATE", figsize=(16,12))

分解一下

加载
import multiscale_phate as mp
import numpy as np
import pandas as pd

import scprep
import os
示例数据,10X pbmc数据


## Save data directory
data_dir = os.path.expanduser("~/multiscale_phate_data") # enter path to data directory here (this is where you want to save 10X data)
if not os.path.isdir(data_dir):
    os.mkdir(data_dir)

file_name = '10X_pbmc_data.h5'
file_path = os.path.join(data_dir, file_name)

URL = 'https://cf.10xgenomics.com/samples/cell-exp/2.1.0/pbmc4k/pbmc4k_raw_gene_bc_matrices_h5.h5'

scprep.io.download.download_url(URL, file_path)

data = scprep.io.load_10X_HDF5(file_path, gene_labels='both')

data.head()
Barcode * gene
质控和预处理
data = scprep.filter.filter_library_size(data, cutoff=1000,  keep_cells='above')
data = scprep.filter.filter_rare_genes(data)
data_norm, libsize = scprep.normalize.library_size_normalize(data, return_library_size=True)
data_sqrt = np.sqrt(data_norm)
data_sqrt.head()
图片.png

Creating multi-resolution embeddings and clusters with Multiscale PHATE

Computing Multiscale PHATE tree involves two successive steps:

  • Building the Multiscale PHATE operator
  • Fitting your data with the operator to construct a diffusion condensation tree and running gradient analysis to identify stable resolutions for downstream analysis

Here we set the random_state to enhance reproducibility.

mp_op = mp.Multiscale_PHATE(random_state=1)
levels = mp_op.fit(data_sqrt)

In order to identify salient levels of the diffusion condensation tree, we can visualize the output of our gradient analysis and highlight stable resolutions for downstream analysis:

import matplotlib.pyplot as plt
ax = plt.plot(mp_op.gradient)
ax = plt.scatter(levels, mp_op.gradient[levels], c = 'r', s=100)
图片.png

Visualizing full Diffusion Condensation tree

由于 Diffusion Condensation 创建了细胞和集群的层次结构,因此可视化这棵树并将迭代或集群标签映射到这棵树上会很有用。 我们可以首先使用 build_tree() 函数构建树:

### building tree

tree = mp_op.build_tree()

scprep.plot.scatter3d(tree, s= 50,
                      fontsize=16, ticks=False, figsize=(10,10))
图片.png

It can also be useful to color the tree with various labels, such as diffusion condensation iteration and by a particular layer of the tree. Since the tree is effectively a series of stacked 2D condensed points, coloring the tree by the third column will color each point by its corresponding iteration:

scprep.plot.scatter3d(tree, c = tree[:,2], s= 50,
                      fontsize=16, ticks=False, figsize=(10,10))
图片.png

In order to color the tree by clusters found at a paticular granularity of the Diffusion Condensation tree, we simply pass a resolution identified by the gradient analysis to the .get_tree_clusters() function and color our tree embedding with the result. Play around with the clustering level you pass to the .get_tree_clusters() function and see what happens:

tree_clusters = mp_op.get_tree_clusters(levels[9])

scprep.plot.scatter3d(tree, c = tree_clusters, s= 50,
                      fontsize=16, ticks=False, figsize=(10,10))
图片.png

Visualizing Coarse Granularity

Now we are ready to produce an initial coarse embedding of the dataset. When running the .transform() function, we select a coarse resolution for our clusters (level 136 - the 9th salient resolution identified in this example) and a finer resoultion for our embedding (level 53 - the 2nd salient resolution identified in this example). By modifying the ideal resolutions passed to the .transform() function, we can modify the granularity of the visualization and clusters, producing coarser or finer embeddings and groupings of the data. We recommend playing around with these resolutions.

coarse_embedding, coarse_clusters, coarse_sizes = mp_op.transform(visualization_level = levels[2],cluster_level = levels[9])
scprep.plot.scatter2d(coarse_embedding, s = 100*np.sqrt(coarse_sizes), c = coarse_clusters,
                      fontsize=16, ticks=False,label_prefix="Multiscale PHATE", figsize=(10,8))
图片.png

Next, we can identify specific clusters to cell types by mapping and visualizing the expression of key marker genes for T cells (CD3E), B cells (CD19) and Monocytes (CD14) to our coarse embedding. To run this mapping we run the .get_expression() function by passing the full expression vector from single cells as well as the resolution of the visualization

We would like to note that you can perform MELD (Burkhardt et al. 2021) at this resolution as well by running get_expression() on a binarized perturbation signal [0,1] that denotes the perturbation of origin for a given cell.

coarse_expression = pd.DataFrame()
coarse_expression['CD3E'] = mp_op.get_expression(data_sqrt['CD3E (ENSG00000198851)'].values,
                                                 visualization_level =  levels[2])
coarse_expression['CD19'] = mp_op.get_expression(data_sqrt['CD19 (ENSG00000177455)'].values,
                                                 visualization_level =  levels[2])
coarse_expression['CD14'] = mp_op.get_expression(data_sqrt['CD14 (ENSG00000170458)'].values,
                                                 visualization_level =  levels[2])
fig, axes = plt.subplots(1,3, figsize=(14, 4))

genes = ['CD3E', 'CD19', 'CD14']

for i, ax in enumerate(axes.flatten()):
    scprep.plot.scatter2d(coarse_embedding, s = 25*np.sqrt(coarse_sizes),
                          c=coarse_expression[genes[i]], legend_anchor=(1,1), ax=ax, title=genes[i],
                          xticks=False, yticks=False, label_prefix="PHATE", fontsize=16, cmap = 'RdBu_r')

fig.tight_layout()
图片.png

Visualizing Fine Granularity

Next, multiscale PHATE allows users to 'zoom in' on populations of interest and perform finer grained analysis using the .transform() and .get_expression() functions.

Using these the .transform() function can get a little confusing. Essentially, we have select a coarse resolution of clusters (coarse_cluster_level) and then a cluster of interest to zoom in on in this resolution (coarse_cluster). Then, we can embed this population at a finer resolution (visualization_level as before) and a finer resolution of clusters (cluster_level). Again, please play around with each of these parameters to embed different clusters across granularities:

zoom_embedding, zoom_clusters, zoom_sizes =  mp_op.transform(visualization_level=levels[1],
                                                             cluster_level=levels[2],
                                                             coarse_cluster_level=levels[9],
                                                             coarse_cluster=8)


scprep.plot.scatter2d(zoom_embedding, s = 500*np.sqrt(zoom_sizes), c = zoom_clusters,
                      fontsize=16, ticks=False,label_prefix="Multiscale PHATE", figsize=(10,8))
图片.png

Next, we can identify the identities of subpopulations of interest by mapping the expression of known markers. This is done using the get_expression() function but, as with the .transform() function, we also have to pass coarse_cluster_level and coarse_cluster to indicate which population we intend to zoom in on.

In this case, we zoom into B cells and map the expression of key genes to identify B cell subpopulations - CD19 for Naive B cells, CD20 (gene name MS5A1) for Activated B cells and CD27 for Memory B cells:

fine_expression = pd.DataFrame()
fine_expression['CD19'] = mp_op.get_expression(data_sqrt['CD19 (ENSG00000177455)'].values,
                                                 visualization_level =  levels[1],
                                                 coarse_cluster_level=levels[9],
                                                 coarse_cluster=8)
fine_expression['CD27'] = mp_op.get_expression(data_sqrt['CD27 (ENSG00000139193)'].values,
                                                 visualization_level =  levels[1],
                                                 coarse_cluster_level=levels[9],
                                                 coarse_cluster=8)
fine_expression['CD20'] = mp_op.get_expression(data_sqrt['MS4A1 (ENSG00000156738)'].values,
                                                 visualization_level =  levels[1],
                                                 coarse_cluster_level=levels[9],
                                                 coarse_cluster=8)


fig, axes = plt.subplots(1,3, figsize=(14, 4))

genes = ['CD19','CD27','CD20']

for i, ax in enumerate(axes.flatten()):
    scprep.plot.scatter2d(zoom_embedding, s = 50*np.sqrt(zoom_sizes),
                          c=fine_expression[genes[i]], legend_anchor=(1,1), ax=ax, title=genes[i],
                          xticks=False, yticks=False, label_prefix="PHATE", fontsize=16, cmap = 'RdBu_r')

fig.tight_layout()
图片.png

生活很好,有你更好

你可能感兴趣的:(10X单细胞(10X空间转录组)分析之学习和可视化抽象细胞特征和数据分组(Multiscale PHATE))