hello ,童鞋们,今天我们来分享一个有关拟时分析的软件-----VIA,文章在VIA: Generalized and scalable trajectory inference in single-cell omics data,这个方法我认为很经典,值得大家学习,我们还是先分享文献,最后示例参考代码。
Abstract
Inferring cellular trajectories using a variety of omic data is a critical task in single-cell data science.However, accurate prediction of cell fates, and thereby biologically meaningful discovery, is challenged by the sheer size of single-cell data, the diversity of omic data types, and t he complexity of their topologies(这个地方大家应该深有感触,我们通常用的拟时分析软件都无法准确的得到轨迹发生的真相,参数一调,结果完全不一样了). We present VIA, a scalable trajectory inference algorithm that overcomes these limitations by using lazy-teleporting random walks(这个方法也是第一次听说) to accurately reconstruct complex cellular trajectories beyond tree-like pathways (e.g. cyclic or disconnected structures). We show that VIA robustly and efficiently unravels the fine-grained sub-trajectories in a 1.3-million-cell(通量确实很高啊) transcriptomic mouse atlas without losing the global connectivity at such a high cell count(能够保留整体分布结构). We further apply VIA to discovering elusive lineages and less populous cell fates missed by other methods across a variety of data types(少量稀有细胞的轨迹发生,这个很重要), including single-cell proteomic, epigenomic, multi-omics datasets, and a new in-house single-cell morphological dataset。
Background
Single-cell omics data captures snapshots of cells t hat catalog cell types and molecular states with high precision.但是目前的软件算法面临4个挑战:
(1)it remains difficult to accurately reconstruct high-resolution cell trajectories and detect the pertinent cell fates and lineages without relying on prior knowledge of input parameter settings(先验知识和参数调整确实是目前拟时分析的痛点,一般做轨迹分析都需要先定义才可以,而且参数的影响非常大),This is a foundational but unmet attribute of trajectory inference (TI) that could make lineage prediction less biased towards input parameters, and thus minimize the confounding factors that impact the underlying hypothesis testing ,However, even the few algorithms which automate cell fate detection (e.g., SlingShot , Palantir and Monocle3) exhibit low sensitivity to cell fates and are highly susceptible to changes in input parameters(这个也是通病,整体结构保持的前提下,稀有细胞类型就被“合并”了,做过的都知道)。
(2)Second, current trajectory inference (TI) methods predominantly work well o n tree-like trajectories (e.g. Slingshot), but lack the generalisability to infer disconnected, cyclic or hybrid topologies without imposing restrictions on transitions and causality (这个也是痛点,听过我对于细胞轨迹分析课程的同学来说应该都知道这一点,可参考文章10X单细胞轨迹分析之回顾).This attribute is crucial in enabling unbiased discovery of complex trajectories which are commonly not well known a priori, especially given the increasing diversity of single-cell omic datasets.(说白了就是对于复杂轨迹的判断仍然是一个大问题)。
(3)Third, the growing scale of single-cell data,exceeds the existing T I capacity, not just in runtime and memory, but in preserving both the fine-grain resolution of the embedded trajectories and the global connectivity among them(这个也是问题,典型如monocle2,过多的细胞无法运算)。such global information is lost in current TI methods after extensive dimension reduction or subsampling.(无法很好的运算当然就是结构的丢失)。
(4)fueling the advance in single-cell technologies is the ongoing pursuit to understand cellular heterogeneity from a broader perspective beyond transcriptomics(技术上的需求)。数据的多样化确实带来了这样的问题。However, the applicability of TI to a broader spectrum of single-cell data has yet to be fully exploited。
于是作者新开发了一个方法,VIA,reconstruct cell lineages based on lazy-teleporting random walks integrated with Markov chain Monte Carlo (MCMC) refinement,当然,后面说了很多软件的有点。注意一下这里allows capture of cellular trajectories not only in multi-furcations and trees, but also in disconnected and cyclic topologies.可以说是巨大的进步。
有点除了上面这个以外,还有robust to a wide r ange of pre-processing and input algorithmic parameters,allow VIA to consistently identify pertinent lineages that remain elusive。后面有一些数据上的运用。resilience in handling the wide disparity in single-cell data size structure a nd dimensionality a cross modalities.(通量很大,这个很可以),VIA is highly scalable with respect to number of cells (102 to 106 cells) and f eatures, without requiring extensive d imensionality reduction or subsampling which compromise global information.(这个通量真的是相当可以),当然还有一些新技术上的运用。
Result
Algorithm
我们来看看算法过程:
(1)Step 1 : Single-cell l evel g raph i s c lustered s uch t hat e ach node represents a c luster of single cells ( computed by our clustering algorithm PARC) . The resulting cluster graph forms the basis for subsequent random walks(随机游走这个概念不知道大家听过多少,这里就是为了聚类)。
(2)Step 2 : 2-stage pseudotime computation: (i) The pseudotime ( relative to a user defined start cell) is first computed by the expected hitting time for a lazy-teleporting random walk a long an undirected graph. At each step, the walk ( with small probability) can remain ( orange arrows) or teleport ( red arrows) to any other state. (ii) Edges are then forward biased based on the expected hitting time ( See forward biased edges illustrated as the imbalance of double-arrowhead size). The pseudotime is further refined on the directed graph by running Markov chain Monte Carlo ( MCMC) simulations ( See 3 highlighted paths starting at root).(这个地方其实运用到了一些算法,比较复杂,但是直观来看更加科学)。
(3)Step 3 : Consensus vote on terminal states based on vertex connectivity properties of the directed graph(准确检测末端细胞,这个大部分软件都无法做到)。
(4)Step 4 : lineage likelihoods computed as the visitation frequency under lazy-teleporting MCMC simulations.(构建轨迹)。
(5)可视化和下游分析。
我们来详细分析一下
VIA first represents t he single-cell data as a cluster graph(这个地方的聚类信息需要我们用其他软件来做),computed by our recently developed data-driven community-detection algorithm,PARC, which allows scalable clustering whilst preserving global properties of the t opology needed for accurate TI(当然不必非要用到这个软件推荐的方法)。
The cell fates and their lineage pathways are then c omputed by a two-stage probabilistic method(概率方法),which is the key algorithmic contribution of this work In the first stage o f Step 2, VIA models the cellular process as a modified random walk that allows degrees of laziness (remaining at a node/state) and teleportation (jumping to a ny other node/state) with pre-defined probabilities.(这个地方其实会构建更加复杂的轨迹结构,有局部,有跳跃,有循环)。The pseudotime, and thus the graph directionality, can be computed based on the theoretical hitting times of nodes。(轨迹图标方向性的构建)。The lazy-teleporting behavior prevents the expected hitting time from converging(汇聚) to a local distribution in the graph as otherwise occurs in regular random walks, especially when the sample size grows。(这个措施其实还是很需要的)。More specifically, the laziness and teleportation factors regulate the weights given to each eigenvector-value pair in the expected hitting time formulation(配方,公式,表达方式) such that the stationary distribution (given by the local-node degree-properties in regular walks) does not overwhelm the global information provided by other ‘eigen-pairs’.(不知道大家懂这里在说什么么???需要一些数学的基础知识)。Moreover, the computation does not require subsetting the first k eigenvectors (bypassing the need for the user to select a suitable threshold or subset of eigenvectors) since the dimensionality is not on the order of number of cells, but is equal to the number of clusters(做过单细胞的同学这个地方应该很熟悉吧)。Hence all eigenvalue-eigenvector pairs can be incorporated without causing a bottleneck in runtime.Consequently in VIA, the modified walk on a cluster-graph not only enables scalable pseudotime computation for large datasets in terms of runtime, but also preserves information about t he global neighborhood relationships within the graph.(看来能处理大规模的细胞量是有原因的,能够保持全局结构也是采用了特征向量)。In the second stage of Step 2, VIA infers the directionality of the graph by biasing the edge-weights with the initial pseudotime computations, and refines the pseudotime through lazy-teleporting MCMC simulations on the forward biased graph.(这是结果了)。
the MCMC-refined graph-edges of the lazy-teleporting r andom walk enable accurate predictions of terminal cell fates through a consensus vote of various vertex connectivity properties derived from the directed graph.(稀有细胞的检测)。The cell fate predictions obtained using this approach are more accurate and robust to changes in input d ata and parameters compared to other TI methods(稳定性和准确性更高)。Trajectories towards identified terminal states are then r esolved using lazy-teleporting MCMC simulations。Together, these four steps facilitate holistic(整体的) topological visualization of TI on the single-cell level (e.g. using UMAP or PHATE(注意这里,个人推荐使用PHATE,可以参考文章10X单细胞降维分析之PHATE))以及一些下游的分析。
理论部分全是精华,需要很深的基础才可以理解,大家继续加油。
VIA accurately captures complex topologies obscured in other TI methods
我们来看看方法之间的比较
In a 4 -leaf multifurcation topology,当然这个数据是模拟的(下图)
VIA accurately captures the two cascading bifurcations which lead to 4 leaf nodes。
In particular, VIA detects the elusive ‘M2’ terminal s tate whereas other methods (Palantir, PAGA, Slingshot a nd Monocle3) merge it with the ‘M8’ lineage (末端的检测更为准确,但这里是作者自己测的,可靠性性如何需要我们自己的检验)。Monocle3 typically only captures a single bifurcation and thus merges the pairs of leaves that otherwise arise from t he second layer of bifurcation (monocle3这个软件局限性特别大,不太建议大家使用)。
Even for the fairly simple cyclic topology,(下图):
other methods tend to fragment the structure to varying degrees depending on the parameter choice whereas VIA consistently preserves the global cyclic structure. (其实环状结构的发育轨迹我们能遇到的非常少,能够检测的软件就更少了)。This is not to say VIA is invariant to parameter choice, but rather that VIA p redictably modulates the graph resolution across a wide range of K without disrupting the underlying global topology(see the increase in the number of nodes in K=30 versus K=5) (全局的判断更为准确)This characteristic is important for robustly analyzing multiple levels of resolution in complex graph topologies, as also shown in our later investigation of the 1.3-million-cell mouse atlas.(这个是作者的观点,个人非常赞同)。We quantify graph-edge accuracy in the cyclic and disconnected datasets by identifying false/true positive/negative edges relative to the reference truth in order to compute an F1-score The performance comparison for the disconnected hybrid topologies shows that VIA disentangles the cyclic and b ifurcating lineages and captures the key leaf-states in the bifurcation as well as the ‘tail’ extending from the cyclic topology Palantir overly fragments the two trajectories, whereas Monocle3 and Slingshot merge them.(其实这种情况才是我们通常会遇到的情况,然而我们一般采用monocle2的方法把轨迹拟合到了一起,得到的结果其实是错误的)。
上面采用的数据是模拟数据,不能表示在真正运用的过程中就会非常好。
VIA reveals rare lineages in epigenomic and transcriptomic l andscapes of human hematopoiesis.(第一个实例)。
检测到了典型的分化结构,The automated detection of these terminal states in VIA, as quantified by F1-scores on the annotated cells, remains robust to varying the number of neighbors in the KNN graph, and the number of principal components (PCs)(上面图C)。Specifically, VIA’s sustained sensitivity to rarer cell types can be attributed t o a better underlying graph structure w here nodes are well delineated by PARC (as rare cell types are well separated by graph pruning in the clustering stage) and edges are not prematurely removed due to restrictions on causality(作者把轨迹结构的准确性归功于PRPC的降维)。
后面和其它软件的比较,不用想肯定VIA最好,不然发不出来。
后面的实际运用,我们简单看看吧
VIA detects small endocrine Delta lineages and Beta subtypes
这部分的结果主要是对稀有细胞群检测的准确性。
To show that this is not a co-incidence of parameter choice, we verify that these populations can be identified for a wide range of chosen highly variable genes (HVGs) and number of PCs。
VIA recovers Isl1+ cardiac progenitor bifurcation in multi-omic data(多组学数据,这里主要就是scRNA和scATAC)。
VIA preserves global connectivity when scaling to millions of cells(这部分讲扩展性示例)
当然,软件采用的主成分进行推断轨迹,这一点是应该的。
VIA’s lazy-teleporting MCMCs delineate mesoderm differentiation i n m ass cytometry data
结果依旧良好。
VIA captures morphological trends of live cells in cell cycle progression(细胞周期的分化)
这部分的个性化分析很高。
最后,我们来看看代码
import pyVIA.core as via
import numpy as np
import pandas as pd
import scanpy as sc
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# 5780 cells x 14651 genes Human Replicate 1. Male african american, 38 years
ad = sc.read( '/home/shobi/Trajectory/Datasets/HumanCD34/human_cd34_bm_rep1.h5ad')
# we use annotations given by SingleR which uses Novershtern Hematopoietic cell data as the reference. These are in line
# with the annotations given by Setty et al., 2019 but offer a slightly more granular annotation.
nover_labels = pd.read_csv('/home/shobi/Trajectory/Datasets/HumanCD34/Nover_Cor_PredFine_notLogNorm.csv')['x'].values.tolist()
dict_abb = {'Basophils': 'BASO1', 'CD4+ Effector Memory': 'TCEL7', 'Colony Forming Unit-Granulocytes': 'GRAN1',
'Colony Forming Unit-Megakaryocytic': 'MEGA1', 'Colony Forming Unit-Monocytes': 'MONO1',
'Common myeloid progenitors': "CMP", 'Early B cells': "PRE_B2", 'Eosinophils': "EOS2",
'Erythroid_CD34- CD71+ GlyA-': "ERY2", 'Erythroid_CD34- CD71+ GlyA+': "ERY3",
'Erythroid_CD34+ CD71+ GlyA-': "ERY1", 'Erythroid_CD34- CD71lo GlyA+': 'ERY4',
'Granulocyte/monocyte progenitors': "GMP", 'Hematopoietic stem cells_CD133+ CD34dim': "HSC1",
'Hematopoietic stem cells_CD38- CD34+': "HSC2",
'Mature B cells class able to switch': "B_a2", 'Mature B cells class switched': "B_a4",
'Mature NK cells_CD56- CD16- CD3-': "Nka3", 'Monocytes': "MONO2",
'Megakaryocyte/erythroid progenitors': "MEP", 'Myeloid Dendritic Cells': 'mDC (cDC)', 'Naïve B cells': "B_a1",
'Plasmacytoid Dendritic Cells': "pDC", 'Pro B cells': 'PRE_B3'}
#NOTE: Myeloid DCs are now called Conventional Dendritic Cells cDCs
nover_labels = [dict_abb[i] for i in nover_labels]
for i in list(set(nover_labels)):
print('the population of ', i, 'is ', nover_labels.count(i))
tsnem = ad.obsm['tsne']
adata_counts = sc.AnnData(ad.X) # ad.X is filtered, lognormalized,scaled// ad.raw.X is the filtered but not pre-processed
adata_counts.obs_names = ad.obs_names
adata_counts.var_names = ad.var_names
sc.tl.pca(adata_counts, svd_solver='arpack', n_comps=200)
true_label = nover_labels
ncomps=100#80
knn=30
v0_random_seed=4
v0 = via.VIA(adata_counts.obsm['X_pca'][:, 0:ncomps], true_label, jac_std_global=0.15, dist_std_local=1, knn=knn,
too_big_factor=0.3,
root_user=4823, dataset='humanCD34', preserve_disconnected=True, random_seed=v0_random_seed,
do_magic_bool=True, is_coarse=True,pseudotime_threshold_TS=20, neighboring_terminal_states_threshold=3) # *.4 root=1,
v0.run_VIA()
super_labels = v0.labels
df_ = pd.DataFrame(ad.X)
df_.columns = [i for i in ad.var_names]
print('start magic')
gene_list_magic = ['IL3RA', 'IRF8', 'GATA1', 'GATA2', 'ITGA2B', 'MPO', 'CD79B', 'SPI1', 'CD34', 'CSF1R', 'ITGAX']
df_magic = v0.do_magic(df_, magic_steps=3, gene_list=gene_list_magic)
df_magic_cluster = df_magic.copy()
df_magic_cluster['parc'] = v0.labels
df_magic_cluster = df_magic_cluster.groupby('parc', as_index=True).mean()
print('end magic', df_magic.shape)
f, ((ax, ax1)) = plt.subplots(1, 2, sharey=True, figsize = [20,10])
v0.draw_piechart_graph(ax, ax1, type_pt='gene', gene_exp=df_magic_cluster['GATA1'].values, title='GATA1')
tsi_list = via.get_loc_terminal_states(v0, adata_counts.obsm['X_pca'][:, 0:ncomps])
v1 = via.VIA(adata_counts.obsm['X_pca'][:, 0:ncomps], true_label, jac_std_global=0.15, dist_std_local=1, knn=knn,
too_big_factor=0.05, super_cluster_labels=super_labels, super_node_degree_list=v0.node_degree_list,
super_terminal_cells=tsi_list, root_user=4823,
x_lazy=0.95, alpha_teleport=0.99, dataset='humanCD34', preserve_disconnected=True,
super_terminal_clusters=v0.terminal_clusters, is_coarse=False, full_neighbor_array=v0.full_neighbor_array,
ig_full_graph=v0.ig_full_graph, full_distance_array=v0.full_distance_array,
csr_array_locally_pruned=v0.csr_array_locally_pruned,
random_seed=v0_random_seed,pseudotime_threshold_TS=20, neighboring_terminal_states_threshold=3) # *.4super_terminal_cells = tsi_list #3root=1,
v1.run_VIA()
#suppose you had a very large dataset, then you can use this to subsample before visualization
labels = v1.labels
idx = np.random.randint(len(labels), size=len(labels))
super_clus_ds_PCA_loc = via.sc_loc_ofsuperCluster_PCAspace(v0, v1, idx)
embedding = tsnem
via.draw_trajectory_gams(embedding[idx], super_clus_ds_PCA_loc, cluster_labels = np.asarray(labels)[idx], super_cluster_labels=np.asarray(v0.labels)[idx], super_edgelist= v0.edgelist_maxout,
x_lazy=v1.x_lazy, alpha_teleport=v1.alpha_teleport, projected_sc_pt=np.asarray(v1.single_cell_pt_markov)[idx], true_label=np.asarray(true_label)[idx], knn=v0.knn,
final_super_terminal=v1.revised_super_terminal_clusters,
sub_terminal_clusters=v1.terminal_clusters,
title_str='Hitting times: Markov Simulation on biased edges', ncomp=ncomps)
# DRAW EVOLUTION PATHS
knn_hnsw = via.make_knn_embeddedspace(embedding[idx])
via.draw_sc_evolution_trajectory_dijkstra(v1, embedding[idx], knn_hnsw, v0.full_graph_shortpath, idx)
plt.show()
gene_name_dict = {'GATA1': 'GATA1', 'GATA2': 'GATA2', 'ITGA2B': 'CD41 (Mega)', 'MPO': 'MPO (Mono)',
'CD79B': 'CD79B (B)', 'IRF8': 'IRF8 (DC)', 'SPI1': 'PU.1', 'CD34': 'CD34',
'CSF1R': 'CSF1R (cDC Up. Up then Down in pDC)', 'IL3RA': 'CD123 (pDC)', 'IRF4': 'IRF4 (pDC)',
'ITGAX': 'ITGAX (cDCs)', 'CSF2RA': 'CSF2RA (cDC)'}
for gene_name in ['ITGA2B', 'IL3RA', 'IRF8', 'MPO', 'CSF1R', 'GATA2', 'CD79B',
'CD34']: # ['GATA1', 'GATA2', 'ITGA2B', 'MPO', 'CD79B','IRF8','SPI1', 'CD34','CSF1R','IL3RA','IRF4', 'CSF2RA','ITGAX']:
print('gene name', gene_name)
# DC markers https://www.cell.com/pb-assets/products/nucleus/nucleus-phagocytes/rnd-systems-dendritic-cells-br.pdf
loc_gata = np.where(np.asarray(ad.var_names) == gene_name)[0][0]
subset_ = df_magic[gene_name].values
# print('shapes of magic_ad 1 and 2', magic_ad.shape,subset_.shape)
# v1.get_gene_expression(magic_ad,title_gene = gene_name_dict[gene_name])
v1.get_gene_expression(subset_, title_gene=gene_name_dict[gene_name])
主要的资料都在这里VIA.
生活很好,有你更好