hello,大家好,今天来分享一个新的分析内容,那就是单细胞空间联合分析过程种存在的问题,就是目前大多数单细胞空间联合分析的软件,在没有明确方法将细胞分层为离散类型或亚型的环境中应用更具挑战性。This is especially important when cells that belong to the same overall type (e.g., T helper cells) may carry different functions and span a continuum of states。作为解决单细胞数据分析这一基本难题的一种方法,当前的算法可以选择设置要分析的数据的精度(即,每个广泛细胞类型的cluster数量)。 然而,存在一些固有的权衡:scRNA-seq 数据的更深层次聚类提供了更精细的转录组分辨率,但使解卷积问题更加困难,结果可能不太准确,而今天,我们就为了探讨这个问题,是不是单细胞划分的越精细,单细胞空间联合分析的效果越好??空间上能不能表征一种细胞类型的连续变化状态??
今天我们参考的文章在Multi-resolution deconvolution of spatial transcriptomics data reveals continuous patterns of inflammation,主要探讨的问题在于如何权衡单细胞的分类精度以及单细胞空间联合的准确性,我们更加希望的是,能在空间上准确表征细胞类型的不同状态,当然,作者也提供了一些借鉴的算法,我们慢慢解析。
Abstract
当然我们都知道单细胞空间联合是一种非常好的分析手段,能解决我们很多的生物学问题,但是即使在一种细胞类型中,也存在无法明确划分但反映细胞功能和与周围环境相互作用方式的重要差异的细胞状态的连续体(也就是亚群及周围环境的差异),那么多精度分析细胞的不同状态就尤为重要。
Introduction
目前单细胞空间联合分析的方法有NMFReg, RCTD, SPOTLight, Stereoscope, DSTG, and cell2location(这些方法我都分享过,大家可以回看一下,方法各有各的优劣势),这些方法联合的过程主要分为两步,首先,从scRNA-seq数据推断出细胞类型转录特征; 然后,使用线性模型估计每个点内每种细胞类型的比例。 这种方法在某些情况下取得了良好的结果,特别是在分析脑组织切片时,其中细胞组成的多样性被细胞类型的离散视图很好地捕获。当然,缺点也很明显,当细胞类型划分的过细(细胞状态区别出来),不仅仅计算上有很大的消耗,而且准确度也会下降(deeper clustering of the scRNA-seq data provides more granular transcriptomic resolution but makes the deconvolution problem more difficult, and the results potentially less accurate.)。
这里作者提出了一个新的角度,使用条件深度生成模型学习细胞类型特异性概况和连续亚细胞类型变异(就是每种细胞类型的特异性和连续性都要model),并恢复细胞类型频率以及每个点平均转录状态的细胞类型特异性snapshot,当然,作者也将方法运用到了具体的案例,我们来解析一下算法和效果。
Result
算法原理
DestVI(作者集成的软件)使用两种不同的潜在变量模型 (LVM) 来描绘细胞类型比例以及细胞类型特定的连续子状态。 DestVI的输入是一对转录组数据集:query空间转录组数据以及来自同一组织的参考 scRNA-seq 数据。
- 注:(A)A spatial transcriptomics analysis workflow relies on two data modalities, producing unpaired transcriptomic measurements, in the form of count matrices. The spatial transcriptomics (ST) data measures the gene expression in a given spot , and its location λ. However, each spot may contain multiple cells. The single cell RNA-sequencing data measures the gene expression in a cell , but the spatial information is lost because of tissue dissociation. After annotation, we may associate each cell with a cell type . These matrices are the input to DestVI, composed of two latent variable models: the single-cell latent variable model (scLVM) and the spatial transcriptomics latent variable model (stLVM). DestVI outputs a joint representation of the single-cell data, and the spatial data by estimating the proportion of every cell type in every spot, and projecting the expression of each spot onto cell-type-specific latent spaces. These inferred values may be used for performing downstream analysis such as cell-type-specific differential expression and comparative analyses of conditions.(B) Schematic of the scLVM. RNA counts and cell type information from the single cell RNA-sequencing data are jointly transformed by an encoder neural network into the parameters of the approximate posterior of γ.a low-dimensional representation of cell-type-specific cell state. Next, a decoder neural network maps samples from the approximate posterior of γ along with the cell type information to the parameters of a negative binomial distribution for every gene. Note that we use the superscript notation to denote the -th output of the network, that is the -th entry ρ of the vector ρ.(C) Schematic of the stLVM. RNA counts from the spatial transcriptomics data are transformed by an encoder neural network into the parameters of the cell-type-specific embeddings γ Free parameters β encode the abundance of cell type in spot , and may be normalized into cell-type proportions π.Next, the decoder from the scLVM model maps cell-type-specific embeddings γ to estimates of cell-type-specific gene expression. These parameters are averaged across all cell types, weighted by the abundance parameters β,to approximate the gene expression of the spot with a negative binomial distribution.After training, the decoder may be used to perform cell-type-specific imputation of gene expression across all spots.
DestVI 假设参考数据集中的每个细胞都用离散的细胞类型标签进行注释.This spot-level information may then be used for downstream analysis and formulation of biological hypotheses。
这算法写起来真的是有点麻烦。
多软件之间的比较(当然,作者的文章作者软件效果最好)
首先是运用单细胞随即合成模拟“spot”数据,这样的话我们提前知道每个spot的细胞类型的比例,当然同时也模拟了线性细胞状态,To model the continuum of cell states, we construct a linear model for every cell type, with a negative binomial likelihood.
方法比较的结果如下
综合来看,这些结果表明 DestVI 为离散反卷积算法提供了一种良好的替代方案,尤其是当细胞类型中存在丰富的连续转录变异模式时,就像大多数生物模型一样。 具体而言,观察到 DestVI 在基因表达插补方面表现出稳健的性能,同时仍能充分估计细胞类型比例。值得注意的是,分析仅限于所讨论的细胞类型足够丰富的点。 正如预期的那样,DestVI 预测细胞类型特异性基因表达的能力在低频情况下降低,然而,对细胞类型比例估计的准确性的影响要小得多。 DestVI can therefore provide an internal control for which spots can be taken into account when conducting a cell-type-specific analysis of gene expression or cell state.
我们来看看示例代码
Multi-resolution deconvolution of spatial transcriptomics
import sys
#if True, will install via pypi, else will install from source
stable = False
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB and stable:
!pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and not stable:
!pip install --quiet --upgrade jsonschema
!pip install --quiet git+https://github.com/yoseflab/scvi-tools@master#egg=scvi-tools[tutorials]
#!wget --quiet https://github.com/romain-lopez/DestVI-reproducibility/blob/master/lymph_node/deconvolution/ST-LN-compressed.h5ad?raw=true -O ST-LN-compressed.h5ad
#!wget --quiet https://github.com/romain-lopez/DestVI-reproducibility/blob/master/lymph_node/deconvolution/scRNA-LN-compressed.h5ad?raw=true -O scRNA-LN-compressed.h5ad
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from matplotlib.lines import Line2D
import umap
import scvi
from scvi.model import CondSCVI, DestVI
import torch
%matplotlib inline
数据前处理
sc_adata = sc.read_h5ad("scRNA-LN-compressed.h5ad")
sc.pl.umap(sc_adata, color="broad_cell_types")
# let us filter some genes
G = 2000
sc.pp.filter_genes(sc_adata, min_counts=10)
sc_adata.layers["counts"] = sc_adata.X.copy()
sc.pp.highly_variable_genes(
sc_adata,
n_top_genes=G,
subset=True,
layer="counts",
flavor="seurat_v3"
)
sc.pp.normalize_total(sc_adata, target_sum=10e4)
sc.pp.log1p(sc_adata)
sc_adata.raw = sc_adata
Now, let’s load the spatial data and choose a common gene subset
st_adata = sc.read_h5ad("ST-LN-compressed.h5ad")
st_adata.layers["counts"] = st_adata.X.copy()
sc.pp.normalize_total(st_adata, target_sum=10e4)
sc.pp.log1p(st_adata)
st_adata.raw = st_adata
# filter genes to be the same on the spatial data
intersect = np.intersect1d(sc_adata.var_names, st_adata.var_names)
st_adata = st_adata[:, intersect].copy()
sc_adata = sc_adata[:, intersect].copy()
G = len(intersect)
sc.pl.embedding(st_adata, basis="location", color="lymph_node")
Fit the scLVM
CondSCVI.setup_anndata(sc_adata, layer="counts", labels_key="broad_cell_types")
sc_model = CondSCVI(sc_adata, weight_obs=True)
sc_model.train(max_epochs=250)
sc_model.history["elbo_train"].plot()
Deconvolution with stLVM
DestVI.setup_anndata(st_adata, layer="counts")
st_model = DestVI.from_rna_model(st_adata, sc_model)
st_model.train(max_epochs=2500)
st_model.history["elbo_train"].plot()
输出结果
Cell type proportions
st_adata.obsm["proportions"] = st_model.get_proportions()
st_adata.obsm["proportions"]
ct_list = ["B cells", "CD8 T cells", "Monocytes"]
for ct in ct_list:
data = st_adata.obsm["proportions"][ct].values
st_adata.obs[ct] = np.clip(data, 0, np.quantile(data, 0.99))
sc.pl.embedding(st_adata, basis="location", color=ct_list)
正如预期的那样,观察到淋巴结中细胞类型(B 细胞/T 细胞)的强烈区室化。 还观察到单核细胞的差异定位。
Intra cell type information(重点)
# more globally, the values of the gamma are all summarized in this dictionary of data frames
for ct, g in st_model.get_gamma().items():
st_adata.obsm["{}_gamma".format(ct)] = g
st_adata.obsm["B cells_gamma"].head(5)
Because those values may be hard to examine for end-users, we presented several methods for prioritizing the study of different cell types (based on PCA and Hotspot). If you’d like to use those methods, please refer to our DestVI reproducibility repository. If you have suggestions to improve those, and would like to see them in the main codebase, reach out to us.
In this tutorial, we assume that the user have identified key gene modules that vary within one cell type in the single-cell RNA sequencing data (e.g., using Hotspot). We provide here a code snippet for imputing the spatial pattern of the cell type specific gene expression, using the example of the IFN-I inflammation signal.
plt.figure(figsize=(8, 8))
ct_name = "Monocytes"
gene_name = ["Cxcl9", "Cxcl10", "Fcgr1"]
# we must filter spots with low abundance (consult the paper for an automatic procedure)
indices = np.where(st_adata.obsm["proportions"][ct_name].values > 0.03)[0]
# impute genes and combine them
specific_expression = np.sum(st_model.get_scale_for_ct(ct_name, indices=indices)[gene_name], 1)
specific_expression = np.log(1 + 1e4 * specific_expression)
# plot (i) background (ii) g
plt.scatter(st_adata.obsm["location"][:, 0], st_adata.obsm["location"][:, 1], alpha=0.05)
plt.scatter(st_adata.obsm["location"][indices][:, 0], st_adata.obsm["location"][indices][:, 1],
c=specific_expression, s=10, cmap="Reds")
plt.colorbar()
plt.title(f"Imputation of {gene_name} in {ct_name}")
plt.show()
方法看看就好,Seurat还是主流。
生活很好,有你更好