动力学模型
在这里,我们使用通用动力学模型来解释完整的转录动态。
这产生了一些额外的见解,如潜在时间和假定驱动基因的识别。
与以前的教程一样,应用胰腺内分泌发育数据集来展示。
[ ]:
# update to the latest version, if not done yet.
!pip install scvelo --upgrade --quiet
[1]:
import scvelo as scv
scv.logging.print_version()
Running scvelo 0.2.0 (python 3.8.2) on 2020-05-15 00:27.
[2]:
scv.settings.verbosity = 3 # show errors(0), warnings(1), info(2), hints(3)
scv.settings.presenter_view = True # set max width size for presenter view
scv.settings.set_figure_params('scvelo') # for beautified visualization
准备数据
处理包括基因选择、按总大小标准化、log X 和计算速率估计时刻。有关进一步解释,请参阅以前的教程。
[3]:
adata = scv.datasets.pancreas()
[4]:
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=2000)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
Filtered out 20801 genes that are detected in less than 20 counts (shared).
Normalized count data: X, spliced, unspliced.
Logarithmized X.
computing neighbors
finished (0:00:03) --> added
'distances' and 'connectivities', weighted adjacency matrices (adata.obsp)
computing moments based on connectivities
finished (0:00:00) --> added
'Ms' and 'Mu', moments of spliced/unspliced abundances (adata.layers)
动力学模型
我们运行动力学模型来学习剪切动力的完整转录动力学。
它基于可能性的期望最大化框架,通过反复估计反应速率和潜在细胞特异变量的参数,即转录状态和细胞内部潜在时间,旨在了解每个基因的未剪切/剪切相轨迹。
[5]:
scv.tl.recover_dynamics(adata)
recovering dynamics
finished (0:13:31) --> added
'fit_pars', fitted parameters for splicing dynamics (adata.var)
[6]:
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
computing velocities
finished (0:00:04) --> added
'velocity', velocity vectors for each individual cell (adata.layers)
computing velocity graph
finished (0:00:08) --> added
'velocity_graph', sparse matrix with cosine correlations (adata.uns)
运行动力学模型可能需要一段时间。因此,可以存储结果以供重复使用。
[7]:
#adata.write('data/pancreas.h5ad', compression='gzip')
#adata = scv.read('data/pancreas.h5ad')
[8]:
scv.pl.velocity_embedding_stream(adata, basis='umap')
computing velocity embedding finished (0:00:00) --> added 'velocity_umap', embedded velocity vectors (adata.obsm)
动力率参数
RNA转录、拼接和降解的速率在不需要任何实验数据的情况下进行估计。
它们有助于更好地了解细胞身份和表型异质性。
[9]:df = adata.vardf = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]kwargs = dict(xscale='log', fontsize=16)with scv.GridSpec(ncols=3) as pl: pl.hist(df['fit_alpha'], xlabel='transcription rate', **kwargs) pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs) pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)scv.get_df(adata, 'fit*', dropna=True).head()
[9]:
fit_r2 | fit_alpha | fit_beta | fit_gamma | fit_t_ | fit_scaling | fit_std_u | fit_std_s | fit_likelihood | fit_u0 | fit_s0 | fit_pval_steady | fit_steady_u | fit_steady_s | fit_variance | fit_alignment_scaling | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
index | ||||||||||||||||
Sntg1 | 0.401981 | 0.015726 | 0.005592 | 0.088792 | 23.404254 | 42.849447 | 1.029644 | 0.030838 | 0.406523 | 0.0 | 0.0 | 0.159472 | 2.470675 | 0.094304 | 0.149138 | 5.355590 |
Sbspon | 0.624803 | 0.464865 | 2.437113 | 0.379645 | 3.785993 | 0.154771 | 0.058587 | 0.178859 | 0.252135 | 0.0 | 0.0 | 0.182088 | 0.164805 | 0.430623 | 0.674312 | 1.193015 |
Mcm3 | 0.292389 | 3.096367 | 39.995796 | 0.638543 | 2.049463 | 0.013943 | 0.016253 | 0.673142 | 0.228207 | 0.0 | 0.0 | 0.467683 | 0.051432 | 1.927742 | 0.687468 | 0.887607 |
Fam135a | 0.384662 | 0.172335 | 0.118088 | 0.204538 | 11.239574 | 1.124040 | 0.356525 | 0.149868 | 0.283343 | 0.0 | 0.0 | 0.387921 | 1.345830 | 0.393197 | 0.671096 | 3.390194 |
Adgrb3 | 0.384552 | 0.046828 | 0.006750 | 0.196856 | 6.992542 | 71.850736 | 2.153206 | 0.030417 | 0.250195 | 0.0 | 0.0 | 0.068851 | 5.214500 | 0.093570 | 0.556878 | 1.893389 |
估计的基因特定参数包括转录比率(fit_alpha)、拼接率(fit_beta)、降解率(fit_gamma)、切换时间点(fit_t_)、用于校正代表性不足的未剪切读数(fit_scaling)、未剪切和拼接读数的标准偏差(fit_std_u ,fit_std_s),基因可能性(fit_likelihood),推断稳定状态水平(fit_steady_u)与其相应的p值(fit_pval_steady_s),整体模型方差(fit_variance),和一个缩放系数,以比对基因的潜在时间到普遍的基因共享潜在时间(fit_alignment_scaling)。
潜在时间
动力学模型可恢复细胞过程的潜在时间。这个潜伏时间代表细胞的内部时钟,并接近细胞在分化时所经历的实时,分析仅基于其转录动力学。
[10]:scv.tl.latent_time(adata)scv.pl.scatter(adata, color='latent_time', color_map='gnuplot', size=80)
computing terminal states identified 2 regions of root cells and 1 region of end points finished (0:00:00) --> added 'root_cells', root cells of Markov diffusion process (adata.obs) 'end_points', end points of Markov diffusion process (adata.obs)computing latent time finished (0:00:02) --> added 'latent_time', shared time (adata.obs)
[11]:top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]scv.pl.heatmap(adata, var_names=top_genes, sortby='latent_time', col_color='clusters', n_convolve=100)
top基因
驱动基因显示明显的动力学行为,并可通过动力学模型中特征系统地被检测到。
[12]:top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).indexscv.pl.scatter(adata, basis=top_genes[:15], ncols=5, frameon=False)
[13]:var_names = ['Actn4', 'Ppp3ca', 'Cpe', 'Nnat']scv.pl.scatter(adata, var_names, frameon=False)scv.pl.scatter(adata, x='latent_time', y=var_names, frameon=False)
cluster特异top基因
此外,可以计算每个细胞群的部分基因可能性,以便对潜在驱动因素进行特定的cluster识别。
[14]:scv.tl.rank_dynamical_genes(adata, groupby='clusters')df = scv.get_df(adata, 'rank_dynamical_genes/names')df.head(5)
ranking genes by cluster-specific likelihoods finished (0:00:03) --> added 'rank_dynamical_genes', sorted scores by group ids (adata.uns)[14]:
[15]:for cluster in ['Ductal', 'Ngn3 high EP', 'Pre-endocrine', 'Beta']: scv.pl.scatter(adata, df[cluster][:5], ylabel=cluster, frameon=False)