10X单细胞空间联合分析之cell2location的详细梳理

又是周五,马上元旦,不知道大家打算怎么跨年呢?但是,今天还是工作日,我们还是要学习学习,不负韶华。

这一篇还是想详细梳理一下cell2location,包括算法以及代码的写法,有很多学习的地方。

这里演示了如何使用 cell2location 模型将单个细胞参考细胞类型映射到空间转录组数据集。 在这里,使用 10X 单核 RNA 测序 (snRNAseq) 和从小鼠大脑的相邻组织切片生成的 Visium 空间转录组数据(单核 + 空间)

Cell2location 是一种贝叶斯模型,它集成了单细胞 RNA-seq (scRNA-seq) 和多细胞空间转录组学,以高效地绘制大型综合细胞类型参考

图片.png

第一部分,计算单细胞细胞类型的表达特征

第一步,从 scRNA-seq 谱中估计参考细胞类型特征,例如使用常规聚类来识别细胞类型和亚群,然后估计平均cluster基因表达谱。 Cell2location 基于负二项式回归实现了这个估计步骤,它允许跨技术和批次稳健地组合数据(多个样本还涉及到批次).

Loading packages

import sys
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import os

data_type = 'float32'

import cell2location

import matplotlib as mpl
from matplotlib import rcParams
import matplotlib.pyplot as plt
import seaborn as sns

# silence scanpy that prints a lot of warnings
import warnings
warnings.filterwarnings('ignore')

Loading single cell reference data

使用小鼠大脑的配对 Visium 和 snRNAseq 参考数据集(即从相邻组织切片生成)。 该数据集由来自 2 只小鼠的 3 个切片的细胞组成。 已经注释了多个大脑区域的 59 个细胞神经元和胶质细胞群,包括 10 个区域星形胶质细胞亚型(单细胞的细胞定义必须先做好)。

sc_data_folder = './data/'
results_folder = './results/mouse_brain_snrna/'
if os.path.exists(sc_data_folder) is not True:
    os.mkdir(sc_data_folder)
os.system(f'cd {sc_data_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/all_cells_20200625.h5ad')
os.system(f'cd {sc_data_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/snRNA_annotation_astro_subtypes_refined59_20200823.csv')


if os.path.exists(results_folder) is not True:
    os.mkdir('./results')
    os.mkdir(results_folder)

读取单细胞数据和细胞注释结果

## snRNA reference (raw counts)
adata_snrna_raw = anndata.read_h5ad(sc_data_folder + "all_cells_20200625.h5ad")

## Cell type annotations
labels = pd.read_csv(sc_data_folder + 'snRNA_annotation_astro_subtypes_refined59_20200823.csv', index_col=0)

Add cell type labels as columns in adata.obs

####reindex函数,这是第一个需要注意的函数
labels = labels.reindex(index=adata_snrna_raw.obs_names)
adata_snrna_raw.obs[labels.columns] = labels
adata_snrna_raw = adata_snrna_raw[~adata_snrna_raw.obs['annotation_1'].isna(), :]

Reduce the number of genes by discarding lowly expressed genes(通过丢弃低表达基因来减少基因数量)

这是使用 2 个阈值执行的,以去除尽可能多的低表达基因,同时避免容易删除稀有种群标记的高度可变基因选择 (HVG)

  • 包括至少 3% 的细胞表达的所有基因 (cell_count_cutoff2)
  • 包括由至少 0.05% 的细胞表达的基因 (cell_count_cutoff),当它们在非零细胞 (nonz_mean_cutoff) 中具有高计数时

偏向于第二种选择基因的方式,因为第 2 步允许保留由稀有细胞群表达但水平很高的基因,而标准的 HVG 选择方法可以过滤掉这些基因,因为它们的全局均值和方差较低。

# remove cells and genes with 0 counts everywhere
sc.pp.filter_cells(adata_snrna_raw, min_genes=1)
sc.pp.filter_genes(adata_snrna_raw, min_cells=1)

# calculate the mean of each gene across non-zero cells
adata_snrna_raw.var['n_cells'] = (adata_snrna_raw.X.toarray() > 0).sum(0)
adata_snrna_raw.var['nonz_mean'] = adata_snrna_raw.X.toarray().sum(0) / adata_snrna_raw.var['n_cells']

plt.hist2d(np.log10(adata_snrna_raw.var['nonz_mean']),
           np.log10(adata_snrna_raw.var['n_cells']), bins=100,
           norm=mpl.colors.LogNorm(),
           range=[[0,0.5], [1,4.5]]);

nonz_mean_cutoff = np.log10(1.12) # cut off for expression in non-zero cells
cell_count_cutoff = np.log10(adata_snrna_raw.shape[0] * 0.0005) # cut off percentage for cells with higher expression
cell_count_cutoff2 = np.log10(adata_snrna_raw.shape[0] * 0.03)# cut off percentage for cells with small expression

plt.vlines(nonz_mean_cutoff, cell_count_cutoff, cell_count_cutoff2, color = 'orange');
plt.hlines(cell_count_cutoff, nonz_mean_cutoff, 1, color = 'orange');
plt.hlines(cell_count_cutoff2, 0, nonz_mean_cutoff, color = 'orange');
plt.xlabel('Mean count in cells with mRNA count > 0 (log10)');
plt.ylabel('Count of cells with mRNA count > 0 (log10)');
图片.png
Show the number of selected cells and genes:
adata_snrna_raw[:,(np.array(np.log10(adata_snrna_raw.var['nonz_mean']) > nonz_mean_cutoff)
         | np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff2))
      & np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff)].shape
###(40532, 12844)
Filter the object
# select genes based on mean expression in non-zero cells
adata_snrna_raw = adata_snrna_raw[:,(np.array(np.log10(adata_snrna_raw.var['nonz_mean']) > nonz_mean_cutoff)
         | np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff2))
      & np.array(np.log10(adata_snrna_raw.var['n_cells']) > cell_count_cutoff)
              & np.array(~adata_snrna_raw.var['SYMBOL'].isna())]
Add counts matrix as adata.raw
adata_snrna_raw.raw = adata_snrna_raw

Show UMAP of cells(图形展示)

可以通过使用标准的 scanpy 工作流程来检查数据的细胞组成,以生成单个细胞数据的 UMAP 表示。

这个地方要注意一点,变异度最大的第一个PC轴被去掉了;去除批次效应采用的是bbknn
#########################
adata_snrna_raw.X = adata_snrna_raw.raw.X.copy()
sc.pp.log1p(adata_snrna_raw)

sc.pp.scale(adata_snrna_raw, max_value=10)
sc.tl.pca(adata_snrna_raw, svd_solver='arpack', n_comps=80, use_highly_variable=False)

# Plot total counts over PC to check whether PC is indeed associated with total counts
#sc.pl.pca_variance_ratio(adata_snrna_raw, log=True)
#sc.pl.pca(adata_snrna_raw, color=['total_counts'],
#          components=['0,1', '2,3', '4,5', '6,7', '8,9', '10,11', '12,13'],
#          color_map = 'RdPu', ncols = 3, legend_loc='on data',
#          legend_fontsize=10, gene_symbols='SYMBOL')

# remove the first PC which explains large amount of variance in total UMI count (likely technical variation)
adata_snrna_raw.obsm['X_pca'] = adata_snrna_raw.obsm['X_pca'][:, 1:]
adata_snrna_raw.varm['PCs'] = adata_snrna_raw.varm['PCs'][:, 1:]
#########################

# Here BBKNN (https://github.com/Teichlab/bbknn) is used to align batches (10X experiments)
import bbknn
bbknn.bbknn(adata_snrna_raw, neighbors_within_batch = 3, batch_key = 'sample', n_pcs = 79)
sc.tl.umap(adata_snrna_raw, min_dist = 0.8, spread = 1.5)

#########################

adata_snrna_raw = adata_snrna_raw[adata_snrna_raw.obs['annotation_1'].argsort(),:]
with mpl.rc_context({'figure.figsize': [10, 10],
                     'axes.facecolor': 'white'}):
    sc.pl.umap(adata_snrna_raw, color=['annotation_1'], size=15,
               color_map = 'RdPu', ncols = 1, legend_loc='on data',
               legend_fontsize=10)
图片.png

Estimating expression signatures

模型的简单介绍

Model-based estimation of reference expression signatures of cell types :math:g_{f,g} using a regularised Negative Binomial regression. This model robustly derives reference expression signatures of cell types (g_{f,g}) using the data composed of multiple batches (e={1..E}) and technologies (t={1..T}). Adapting the assumptions of a range of computational methods for scRNA-seq, we assume that the expression count matrix follows a Negative Binomial distribution with unobserved expression levels (rates) (\mu_{c,g}) and a gene-specific over-dispersion (\alpha_g). We model (\mu_{c,g}) as a linear function of reference cell type signatures and technical effects: - (e_e) denotes a multiplicative global scaling parameter between experiments/batches (e) (e.g. differences in sequencing depth); - (t_{t,g}) accounts for multiplicative gene-specific difference in sensitivity between technologies; - (b_{e,g}) accounts for additive background shift of each gene in each experiment (e) (proxy for free-floating RNA).
图片.png

Training the model(训练模型)

这里展示如何执行包装到单个管道函数调用中的此模型的训练,如何评估此模型的质量并提取细胞类型的参考签名以与 cell2location 一起使用: (这些参数很值得深入探讨一下)

# Run the pipeline:
from cell2location import run_regression
r, adata_snrna_raw = run_regression(adata_snrna_raw, # input data object]

                   verbose=True, return_all=True,

                   train_args={
                    'covariate_col_names': ['annotation_1'], # column listing cell type annotation
                    'sample_name_col': 'sample', # column listing sample ID for each cell

                    # column listing technology, e.g. 3' vs 5',
                    # when integrating multiple single cell technologies corresponding
                    # model is automatically selected
                    'tech_name_col': None,

                    'stratify_cv': 'annotation_1', # stratify cross-validation by cell type annotation

                    'n_epochs': 100, 'minibatch_size': 1024, 'learning_rate': 0.01,

                    'use_cuda': True, # use GPU?

                    'train_proportion': 0.9, # proportion of cells in the training set (for cross-validation)
                    'l2_weight': True,  # uses defaults for the model

                    'readable_var_name_col': 'SYMBOL', 'use_raw': True},

                   model_kwargs={}, # keep defaults
                   posterior_args={}, # keep defaults

                   export_args={'path': results_folder + 'regression_model/', # where to save results
                                'save_model': True, # save pytorch model?
                                'run_name_suffix': ''})

reg_mod = r['mod']
图片.png

Saved anndata object and the trained model object can be read later using

reg_mod_name = 'RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes'
reg_path = f'{results_folder}regression_model/{reg_mod_name}/'

## snRNAseq reference (raw counts)
adata_snrna_raw = sc.read(f'{reg_path}sc.h5ad')
## model
r = pickle.load(file = open(f'{reg_path}model_.p', "rb"))
reg_mod = r['mod']

Export reference expression signatures of cell types(导出细胞类型的参考表达特征),re的用法我们也要好好学习一下

# Export cell type expression signatures:
covariate_col_names = 'annotation_1'

inf_aver = adata_snrna_raw.raw.var.copy()
inf_aver = inf_aver.loc[:, [f'means_cov_effect_{covariate_col_names}_{i}' for i in adata_snrna_raw.obs[covariate_col_names].unique()]]
from re import sub
inf_aver.columns = [sub(f'means_cov_effect_{covariate_col_names}_{i}', '', i) for i in adata_snrna_raw.obs[covariate_col_names].unique()]
inf_aver = inf_aver.iloc[:, inf_aver.columns.argsort()]

# scale up by average sample scaling factor
inf_aver = inf_aver * adata_snrna_raw.uns['regression_mod']['post_sample_means']['sample_scaling'].mean()

将估计的特征(y 轴)与分析计算的平均表达(x 轴)进行比较:

# compute mean expression of each gene in each cluster
aver = cell2location.cluster_averages.cluster_averages.get_cluster_averages(adata_snrna_raw, covariate_col_names)
aver = aver.loc[adata_snrna_raw.var_names, inf_aver.columns]

# compare estimated signatures (y-axis) to analytically computed mean expression (x-axis)
with mpl.rc_context({'figure.figsize': [5, 5]}):
    plt.hist2d(np.log10(aver.values.flatten()+1), np.log10(inf_aver.values.flatten()+1),
               bins=50, norm=mpl.colors.LogNorm());
    plt.xlabel('Mean expression in each cluster');
    plt.ylabel('Inferred expression in each cluster');
图片.png

评估估计的特征是否因为混淆样本背景已被移除而降低相关性:

# Look at how correlated are the signatures obtained by computing mean expression
with mpl.rc_context({'figure.figsize': [5, 5]}):
    reg_mod.align_plot_stability(aver, aver, 'cluster_average', 'cluster_average', align=False)

# Look at how correlated are the signatures inferred by regression model - they should be less correlated than above
with mpl.rc_context({'figure.figsize': [5, 5]}):
    reg_mod.align_plot_stability(inf_aver, inf_aver, 'inferred_signature', 'inferred_signature', align=False)
图片.png

将每个实验的细胞计数与估计的背景(汤、自由漂浮的 RNA)进行比较:

# Examine how many mRNA per cell on average are background
sample_name_col = 'sample'
cell_count = adata_snrna_raw.obs[sample_name_col].value_counts()
cell_count.index = [f'means_sample_effect{sample_name_col}_{i}' for i in cell_count.index]
soup_amount = reg_mod.sample_effects.sum(0)

with mpl.rc_context({'figure.figsize': [5, 5]}):
    plt.scatter(cell_count[soup_amount.index].values.flatten(),
                soup_amount.values.flatten());
    plt.xlabel('Cell count per sample'); # fraction of reads in cells
    plt.ylabel('Inferred sum of sample effects');
图片.png

Additional quality control: removing technical effects and performing standard scanpy single cell analysis workflow(去除批次效应)

这允许通过检查从每个单个细胞中删除这些因素是否会导致合并样本/批次来确定模型是否成功地考虑了技术因素,同时在 UMAP 空间中保留了分离良好的细胞类型。这里关注一下函数del,del删除的是变量,而不是数据。当然这里的教程主要还是有一些注意的地方,remove the first PC which explains large amount of variance in total UMI count (首个PC都去掉了)。

adata_snrna_raw_cor = adata_snrna_raw.copy()
del adata_snrna_raw_cor.uns['log1p']

adata_snrna_raw_cor.X = np.array(reg_mod.normalise(adata_snrna_raw_cor.raw.X.copy()))

sc.pp.log1p(adata_snrna_raw_cor)
sc.pp.scale(adata_snrna_raw_cor, max_value=10)
# when all RNA of a given gene are additive background this results in NaN after scaling
adata_snrna_raw_cor.X[np.isnan(adata_snrna_raw_cor.X)] = 0
sc.tl.pca(adata_snrna_raw_cor, svd_solver='arpack', n_comps=80, use_highly_variable=False)

adata_snrna_raw.obs['total_counts'] = np.array(adata_snrna_raw.raw.X.sum(1)).flatten()
adata_snrna_raw_cor.obs['total_counts'] = adata_snrna_raw.obs['total_counts'].values.copy()

sc.pl.pca(adata_snrna_raw_cor, color=['total_counts'],
         components=['1,2'],
         color_map = 'RdPu', ncols = 2, legend_loc='on data', vmax='p99.9',
         legend_fontsize=10)

# remove the first PC which explains large amount of variance in total UMI count (likely technical variation)
adata_snrna_raw_cor.obsm['X_pca'] = adata_snrna_raw_cor.obsm['X_pca'][:, 1:]
adata_snrna_raw_cor.varm['PCs'] = adata_snrna_raw_cor.varm['PCs'][:, 1:]

# here we use standard neighbors function rather than bbknn
# to show that the regression model can merge batches / experiments
sc.pp.neighbors(adata_snrna_raw_cor, n_neighbors = 15, n_pcs = 79, metric='cosine')
sc.tl.umap(adata_snrna_raw_cor, min_dist = 0.8, spread = 1)
图片.png
with mpl.rc_context({'figure.figsize': [7, 7],
                     'axes.facecolor': 'white'}):
    sc.pl.umap(adata_snrna_raw_cor, color=['annotation_1', 'sample', 'total_counts'],
               color_map = 'RdPu', ncols = 1, size=13, #legend_loc='on data',
               legend_fontsize=10, palette=sc.pl.palettes.default_102)
图片.png

第二部分,Spatial mapping of cell types across the mouse brain (2/3) - cell2location

Loading packages and setting up GPU

首先,我们需要加载相关的包并告诉cell2location使用GPU。 cell2location 是用 pymc3 语言编写的,用于概率建模,它使用名为 theano 的深度学习库进行大量计算。 虽然该包适用于 GPU 和 CPU,但使用 GPU 可显着缩短 10X Visium 数据集的计算时间。 对于空间位置较少的较小数据集(例如 Nanostring WTA 技术),使用 CPU 更可行

import sys
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import os
import gc

# this line forces theano to use the GPU and should go before importing cell2location
os.environ["THEANO_FLAGS"] = 'device=cuda0,floatX=float32,force_device=True'
# if using the CPU uncomment this:
#os.environ["THEANO_FLAGS"] = 'device=cpu,floatX=float32,openmp=True,force_device=True'

import cell2location

import matplotlib as mpl
from matplotlib import rcParams
import matplotlib.pyplot as plt
import seaborn as sns

# silence scanpy that prints a lot of warnings
import warnings
warnings.filterwarnings('ignore')

Loading Visium data

首先,需要从数据门户下载并解压缩空间数据,以及下载参考细胞类型的注释文件:

# Set paths to data and results used through the document:
sp_data_folder = './data/mouse_brain_visium_wo_cloupe_data/'
results_folder = './results/mouse_brain_snrna/'

regression_model_output = 'RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes'
reg_path = f'{results_folder}regression_model/{regression_model_output}/'

# Download and unzip spatial data
if os.path.exists('./data') is not True:
    os.mkdir('./data')
    os.system('cd ./data && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_wo_cloupe_data.zip')
    os.system('cd ./data && unzip mouse_brain_visium_wo_cloupe_data.zip')

# Download and unzip snRNA-seq data with signatures of reference cell types
# (if the output folder was not created by tutorial 1/3)
if os.path.exists(reg_path) is not True:
    os.mkdir('./results')
    os.mkdir(f'{results_folder}')
    os.mkdir(f'{results_folder}regression_model')
    os.mkdir(f'{reg_path}')
    os.system(f'cd {reg_path} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_snrna/regression_model/RegressionGeneBackgroundCoverageTorch_65covariates_40532cells_12819genes/sc.h5ad')
现在,从 10X Space Ranger 输出中读取空间 Visium 数据并检查几个 QC 图。 在这里,将 Visium 小鼠大脑实验(即切片)和相应的组织学图像加载到单个 anndata 对象 adata 中。

定义函数的用法多学学,别定义的那么难看,

def read_and_qc(sample_name, path=sp_data_folder + 'rawdata/'):
    r""" This function reads the data for one 10X spatial experiment into the anndata object.
    It also calculates QC metrics. Modify this function if required by your workflow.

    :param sample_name: Name of the sample
    :param path: path to data
    """

    adata = sc.read_visium(path + str(sample_name),
                           count_file='filtered_feature_bc_matrix.h5', load_images=True)
    adata.obs['sample'] = sample_name
    adata.var['SYMBOL'] = adata.var_names
    adata.var.rename(columns={'gene_ids': 'ENSEMBL'}, inplace=True)
    adata.var_names = adata.var['ENSEMBL']
    adata.var.drop(columns='ENSEMBL', inplace=True)

    # Calculate QC metrics
    from scipy.sparse import csr_matrix
    adata.X = adata.X.toarray()
    sc.pp.calculate_qc_metrics(adata, inplace=True)
    adata.X = csr_matrix(adata.X)
    adata.var['mt'] = [gene.startswith('mt-') for gene in adata.var['SYMBOL']]
    adata.obs['mt_frac'] = adata[:, adata.var['mt'].tolist()].X.sum(1).A.squeeze()/adata.obs['total_counts']

    # add sample name to obs names
    adata.obs["sample"] = [str(i) for i in adata.obs['sample']]
    adata.obs_names = adata.obs["sample"] \
                          + '_' + adata.obs_names
    adata.obs.index.name = 'spot_id'

    return adata

def select_slide(adata, s, s_col='sample'):
    r""" This function selects the data for one slide from the spatial anndata object.

    :param adata: Anndata object with multiple spatial experiments
    :param s: name of selected experiment
    :param s_col: column in adata.obs listing experiment name for each location
    """

    slide = adata[adata.obs[s_col].isin([s]), :]
    s_keys = list(slide.uns['spatial'].keys())
    s_spatial = np.array(s_keys)[[s in k for k in s_keys]][0]

    slide.uns['spatial'] = {s_spatial: slide.uns['spatial'][s_spatial]}

    return slide

#######################
# Read the list of spatial experiments
sample_data = pd.read_csv(sp_data_folder + 'Visium_mouse.csv')

# Read the data into anndata objects
slides = []
for i in sample_data['sample_name']:
    slides.append(read_and_qc(i, path=sp_data_folder + 'rawdata/'))

# Combine anndata objects together
adata = slides[0].concatenate(
    slides[1:],
    batch_key="sample",
    uns_merge="unique",
    batch_categories=sample_data['sample_name'],
    index_unique=None
)
#######################

注意! 线粒体编码的基因(基因名称以前缀 mt- 或 MT- 开头)与空间映射无关,因为它们的表达代表了单细胞和细胞核数据中的技术产物,而不是线粒体的生物学丰度。 然而,这些基因在每个位置构成了 15-40% 的 mRNA。 因此,为了避免映射伪影,我们强烈建议去除线粒体基因

# mitochondria-encoded (MT) genes should be removed for spatial mapping
adata.obsm['mt'] = adata[:, adata.var['mt'].values].X.toarray()
adata = adata[:, ~adata.var['mt'].values]###这个方法不错

Look at QC metrics

现在让我们看看 QC:Visium 实验中每个位置的总计数和基因总数。
python的enumerate函数,也是值得学习的
# PLOT QC FOR EACH SAMPLE
fig, axs = plt.subplots(len(slides), 4, figsize=(15, 4*len(slides)-4))
for i, s in enumerate(adata.obs['sample'].unique()):
    #fig.suptitle('Covariates for filtering')

    slide = select_slide(adata, s)
    sns.distplot(slide.obs['total_counts'],
                 kde=False, ax = axs[i, 0])
    axs[i, 0].set_xlim(0, adata.obs['total_counts'].max())
    axs[i, 0].set_xlabel(f'total_counts | {s}')

    sns.distplot(slide.obs['total_counts']\
                 [slide.obs['total_counts']<20000],
                 kde=False, bins=40, ax = axs[i, 1])
    axs[i, 1].set_xlim(0, 20000)
    axs[i, 1].set_xlabel(f'total_counts | {s}')

    sns.distplot(slide.obs['n_genes_by_counts'],
                 kde=False, bins=60, ax = axs[i, 2])
    axs[i, 2].set_xlim(0, adata.obs['n_genes_by_counts'].max())
    axs[i, 2].set_xlabel(f'n_genes_by_counts | {s}')

    sns.distplot(slide.obs['n_genes_by_counts']\
                 [slide.obs['n_genes_by_counts']<6000],
                 kde=False, bins=60, ax = axs[i, 3])
    axs[i, 3].set_xlim(0, 6000)
    axs[i, 3].set_xlabel(f'n_genes_by_counts | {s}')

plt.tight_layout()
图片.png

Visualise Visium data in spatial 2D and UMAP coordinates

Visualising data in spatial coordinates with scanpy

Next, we show how to plot these QC values over the histology image using standard scanpy tools
slide = select_slide(adata, 'ST8059048')

with mpl.rc_context({'figure.figsize': [6,7],
                     'axes.facecolor': 'white'}):
    sc.pl.spatial(slide, img_key = "hires", cmap='magma',
                  library_id=list(slide.uns['spatial'].keys())[0],
                  color=['total_counts', 'n_genes_by_counts'], size=1,
                  gene_symbols='SYMBOL', show=False, return_fig=True)
图片.png
Here we show how to use scanpy to plot the expression of individual genes without the histology image.
with mpl.rc_context({'figure.figsize': [6,7],
                     'axes.facecolor': 'black'}):
    sc.pl.spatial(slide,
                  color=["Rorb", "Vip"], img_key=None, size=1,
                  vmin=0, cmap='magma', vmax='p99.0',
                  gene_symbols='SYMBOL'
                 )
图片.png

Add counts matrix as adata.raw

adata_vis = adata.copy()
adata_vis.raw = adata_vis

######## Select two Visium sections to speed up the analysis

选择两个 Visium 部分,也称为实验/批次,以加快分析速度,每个生物复制一个。

s = ['ST8059048', 'ST8059052']
adata_vis = adata_vis[adata_vis.obs['sample'].isin(s),:]

Construct and examine UMAP of locations

现在,我们将标准的扫描处理pipeline应用于空间 Visium 数据,以显示实验数据中的可变性。 重要的是,此工作流程将显示数据中批次差异的程度.

在这个小鼠大脑数据集中,切片之间只有少数区域应该不同,因为我们使用了来自生物复制品的 2 个样本,这些样本在小鼠大脑中沿前后轴的位置略有不同。 我们从两个实验和一些不匹配中看到了位置的一般对齐,这里的实验之间的大部分差异来自批次效应,cell2location 可以解释这一点。

adata_vis_plt = adata_vis.copy()

# Log-transform (log(data + 1))
sc.pp.log1p(adata_vis_plt)

# Find highly variable genes within each sample
adata_vis_plt.var['highly_variable'] = False
for s in adata_vis_plt.obs['sample'].unique():

    adata_vis_plt_1 = adata_vis_plt[adata_vis_plt.obs['sample'].isin([s]), :]
    sc.pp.highly_variable_genes(adata_vis_plt_1, min_mean=0.0125, max_mean=5, min_disp=0.5, n_top_genes=1000)

    hvg_list = list(adata_vis_plt_1.var_names[adata_vis_plt_1.var['highly_variable']])
    adata_vis_plt.var.loc[hvg_list, 'highly_variable'] = True

# Scale the data ( (data - mean) / sd )
sc.pp.scale(adata_vis_plt, max_value=10)
# PCA, KNN construction, UMAP
sc.tl.pca(adata_vis_plt, svd_solver='arpack', n_comps=40, use_highly_variable=True)
sc.pp.neighbors(adata_vis_plt, n_neighbors = 20, n_pcs = 40, metric='cosine')
sc.tl.umap(adata_vis_plt, min_dist = 0.3, spread = 1)

with mpl.rc_context({'figure.figsize': [8, 8],
                     'axes.facecolor': 'white'}):
    sc.pl.umap(adata_vis_plt, color=['sample'], size=30,
               color_map = 'RdPu', ncols = 1, #legend_loc='on data',
               legend_fontsize=10)
图片.png

Load reference cell type signature from snRNA-seq data and show UMAP of cells

接下来,我们加载预处理过的 snRNAseq 参考 anndata 对象,其中包含参考细胞类型的估计表达特征

## snRNAseq reference (raw counts)
adata_snrna_raw = sc.read(f'{reg_path}sc.h5ad')
Export reference expression signatures of cell types
# Column name containing cell type annotations
covariate_col_names = 'annotation_1'

# Extract a pd.DataFrame with signatures from anndata object
inf_aver = adata_snrna_raw.raw.var.copy()
inf_aver = inf_aver.loc[:, [f'means_cov_effect_{covariate_col_names}_{i}' for i in adata_snrna_raw.obs[covariate_col_names].unique()]]
from re import sub
inf_aver.columns = [sub(f'means_cov_effect_{covariate_col_names}_{i}', '', i) for i in adata_snrna_raw.obs[covariate_col_names].unique()]
inf_aver = inf_aver.iloc[:, inf_aver.columns.argsort()]

# normalise by average experiment scaling factor (corrects for sequencing depth)
inf_aver = inf_aver * adata_snrna_raw.uns['regression_mod']['post_sample_means']['sample_scaling'].mean()

Quick look at the cell type composition in our reference data in UMAP coordinates
with mpl.rc_context({'figure.figsize': [10, 10],
                     'axes.facecolor': 'white'}):
    sc.pl.umap(adata_snrna_raw, color=['annotation_1'], size=15,
               color_map = 'RdPu', ncols = 1, legend_loc='on data',
               legend_fontsize=10)
图片.png

Cell2location model description and analysis pipeline

Cell2location 被实现为一个可解释的分层贝叶斯模型,从而 (1) 提供了解释模型不确定性的原则方法; (2) 考虑细胞类型丰度的线性相关性,(3) 对跨技术测量灵敏度的差异进行建模,以及 (4) 通过采用灵活的基于计数的误差模型来考虑无法解释的/残留变化。 最后,(5)cell2location 支持多个实验/批次的联合建模。

Brief description of the model

Briefly, cell2location is a Bayesian model, which estimates absolute cell density of cell types by decomposing mRNA counts (d_{s,g}) of each gene (g={1, .., G}) at locations (s={1, .., S}) into a set of predefined reference signatures of cell types (g_{fg}). Joint modelling mode works across experiments (e={1,..,E}), such as 10X Visium chips (i.e. square capture areas) and Slide-Seq V2 pucks (i.e. beads).

Cell2location models the elements of (d_{s,g}) as Negative Binomial distributed, given an unobserved rate (\mu_{s,g}) and a gene-specific over-dispersion parameter (\alpha_{eg}):
[\begin{split}D_{s,g} \sim \mathtt{NB}(\mu_{s,g}, \alpha_{eg}) \\end{split}]

The expression level of genes (\mu_{s,g}) in the mRNA count space is modelled as a linear function of expression signatures of reference cell types:
[\mu_{s,g} = \underbrace{m_{g}}{\text{technology sensitivity}} \cdot \underbrace{\left (\sum{f} {w_{s,f} : g_{f,g}} \right)}{\text{cell type contributions}} + \underbrace{l_s + s{eg}}_{\text{additive shift}}]

where, (w_{s,f}) denotes regression weight of each reference signature (f) at location (s), which can be interpreted as the number of cells at location (s) that express reference signature (f); (m_{g}) a gene-specific scaling parameter, which adjusts for global differences in sensitivity between technologies; (l_s) and (s_{eg}) are additive variables that account for gene- and location-specific shift, such as due to contaminating or free-floating RNA.

To account for the similarity of location patterns across cell types, (w_{s,f}) is modelled using another layer of decomposition (factorization) using (r={1, .., R}) groups of cell types, that can be interpreted as cellular compartments or tissue zones (Suppl. Methods). Unless stated otherwise, (R) is set to 50.

Selecting hyper-parameters

Note! While the scaling parameter (m_{g}) facilitates the integration across technologies, it leads to non-identifiability between (m_{g}) and (w_{s,f}), unless the informative priors on both variables are used. To address this, we employ informative prior distributions on (w_{s,f}) and (m_{g}), which are controlled by 4 used-provided hyper-parameters. For guidance on selecting these hyper-parameters see below and Suppl. Methods (Section 1.3).

For the mouse brain we suggest using the following values for 4 used-provided hyper-parameters: 1. (\hat{N} = 8), the expected number of cells per location, estimated based on comparison to histology image; 2. (\hat{A} = 9), the expected number of cell types per location, assuming that most cells in a given location belong to a different type and that many locations contain cell processes rather than complete cells; 3. (\hat{Y} = 5), the expected number of co-located cell type groups per location, assuming that very few cell types have linearly dependent abundance patterns, except for the regional astrocytes and corresponding neurons such that on average about 2 cell types per group are expected (\hat{A}/\hat{Y}=1.8); 4. mean and variance that define hyperprior on gene-specific scaling parameter (m_{g}), allowing the user to define prior beliefs on the sensitivity of spatial technology compared to the scRNA-seq reference.

Joing modelling of multiple experiments

Joint modelling of spatial data sets from multiple experiments provides the several benefits due to sharing information between experiments (such as 10X Visium chips (i.e. square capture areas) and Slide-Seq V2 pucks (i.e. beads)):

  • Increasing accuracy by improving the ability of the model to distinguish low sensitivity (m_{g}) from zero cell abundance (w_{r,f}), which is achieved by sharing the change in sensitivity between technologies (m_{g}) across experiments. Similarly to common practice in single cell data analysis, this is equivalent to regressing out the effect of technology but not the effect of individual experiment.

  • Increasing sensitivity by sharing information on cell types with co-varying abundances during decomposition of (w_{s,f}) into groups of cell types (r={1, .., R}).

Training cell2location: specifying data input and hyper-parameters

在这里,展示了如何训练 cell2location 模型来估计每个位置的细胞丰度。 此工作流包装在单个pipeline中:

sc.settings.set_figure_params(dpi = 100, color_map = 'viridis', dpi_save = 100,
                              vector_friendly = True, format = 'pdf',
                              facecolor='white')

r = cell2location.run_cell2location(

      # Single cell reference signatures as pd.DataFrame
      # (could also be data as anndata object for estimating signatures
      #  as cluster average expression - `sc_data=adata_snrna_raw`)
      sc_data=inf_aver,
      # Spatial data as anndata object
      sp_data=adata_vis,

      # the column in sc_data.obs that gives cluster idenitity of each cell
      summ_sc_data_args={'cluster_col': "annotation_1",
                        },

      train_args={'use_raw': True, # By default uses raw slots in both of the input datasets.
                  'n_iter': 40000, # Increase the number of iterations if needed (see QC below)

                  # Whe analysing the data that contains multiple experiments,
                  # cell2location automatically enters the mode which pools information across experiments
                  'sample_name_col': 'sample'}, # Column in sp_data.obs with experiment ID (see above)


      export_args={'path': results_folder, # path where to save results
                   'run_name_suffix': '' # optinal suffix to modify the name the run
                  },

      model_kwargs={ # Prior on the number of cells, cell types and co-located groups

                    'cell_number_prior': {
                        # - N - the expected number of cells per location:
                        'cells_per_spot': 8, # < - change this
                        # - A - the expected number of cell types per location (use default):
                        'factors_per_spot': 7,
                        # - Y - the expected number of co-located cell type groups per location (use default):
                        'combs_per_spot': 7
                    },

                     # Prior beliefs on the sensitivity of spatial technology:
                    'gene_level_prior':{
                        # Prior on the mean
                        'mean': 1/2,
                        # Prior on standard deviation,
                        # a good choice of this value should be at least 2 times lower that the mean
                        'sd': 1/4
                    }
      }

####### Cell2location model output
The results are saved to:

results_folder + r['run_name']

The absolute abundances of cell types are added to sp_data as columns of sp_data.obs. The estimates of all parameters in the model are exported to sp_data.uns['mod']

List of output files:

  • sp.h5ad - Anndata object with all results and spatial data.
  • W_cell_density.csv - absolute abundances of cell types, mean of the posterior distribution.
  • (default) - W_cell_density_q05.csv - absolute abundances of cell types, 5% quantile of the posterior distribution representing confident cell abundance level.
  • W_mRNA_count.csv - absolute mRNA abundance for each cell types, mean of the posterior distribution.
  • (useful for QC, selecting mapped cell types) - W_mRNA_count_q05.csv - absolute mRNA abundance for each cell types, 5% quantile of the posterior distribution representing confident cell abundance level.

Evaluating training

需要通过检查一些诊断图来检查我们的模型是否训练成功。
首先,我们看一下训练迭代中的 ELBO 损失/成本函数。 该图省略了前 20% 的训练迭代,在此期间损失发生了许多数量级的变化。 在这里我们看到模型在训练结束时收敛,ELBO 损失函数中的一些噪声是可以接受的。 如果在最近的几千次迭代中有很大的变化,我们建议增加 'n_iter' 参数。 (需要很多的数学知识)

训练迭代之间 ELBO 损失的差异表明训练问题可能是由于细胞类型的参考不完整或不够详细造成的。

from IPython.display import Image
Image(filename=results_folder +r['run_name']+'/plots/training_history_without_first_20perc.png',
      width=400)
图片.png

We also need to evaluate the reconstruction accuracy: how well reference cell type signatures explain spatial data by comparing expected value of the model (\mu_{s,g}) (Negative Binomial mean) to observed count of each gene across locations. The ideal case is a perfect diagonal 2D histogram plot (across genes and locations).

A very fuzzy diagonal or large deviations of some genes and locations from the diagonal plot indicate that the reference signatures are incomplete. The reference could be missing certain cell types entirely (e.g. FACS-sorting one cell lineage) or clustering could be not sufficiently granular (e.g. mapping 5-10 broad cell types to a complex tissue). Below is an example of good performance:

Image(filename=results_folder +r['run_name']+'/plots/data_vs_posterior_mean.png',
      width=400)
图片.png

最后,需要通过比较两次独立训练重新启动(X 轴和 Y 轴)之间估计细胞丰度的一致性来评估识别位置的稳健性。 下图显示了 2 次训练重启中细胞丰度曲线之间的相关性(颜色)。 某些细胞类型可能是相关的,但与对角线的过度偏差将表明解决方案的不稳定性。

Image(filename=results_folder +r['run_name']+'/plots/evaluate_stability.png',
      width=400)
图片.png

第三部分,单细胞空间联合

Loading cell2location model output

首先,让我们加载 cell2location 结果。 在 cell2location 管道的导出步骤中,将跨位置的细胞类型丰度作为 sp_data.obs 的列添加到 sp_data 中,并将模型的所有参数导出到 sp_data.uns['mod']。 这个 anndata 对象和一个带有spot位置的 csv 文件 W.csv / W_q05.csv 被保存到结果目录中。

Normally, you would have the output on your system (e.g. by running tutorial 2/3), however, you could also start with the output from our data portal:

results_folder = './results/mouse_brain_snrna/'
r = {'run_name': 'LocationModelLinearDependentWMultiExperiment_2experiments_59clusters_5563locations_12809genes'}

# defining useful function
def select_slide(adata, s, s_col='sample'):
    r""" Select data for one slide from the spatial anndata object.

    :param adata: Anndata object with multiple spatial samples
    :param s: name of selected sample
    :param s_col: column in adata.obs listing sample name for each location
    """

    slide = adata[adata.obs[s_col].isin([s]), :]
    s_keys = list(slide.uns['spatial'].keys())
    s_spatial = np.array(s_keys)[[s in k for k in s_keys]][0]

    slide.uns['spatial'] = {s_spatial: slide.uns['spatial'][s_spatial]}

    return slide
if os.path.exists(f'{results_folder}{r["run_name"]}') is not True:
    os.mkdir('./results')
    os.mkdir(f'{results_folder}')
    os.system(f'cd {results_folder} && wget https://cell2location.cog.sanger.ac.uk/tutorial/mouse_brain_visium_results/{["run_name"]}.zip')
    os.system(f'cd {results_folder} && unzip {r["run_name"]}.zip')
We load the results of the model saved into the adata_vis Anndata object:
sp_data_file = results_folder +r['run_name']+'/sp.h5ad'

adata_vis = anndata.read(sp_data_file)

Visualisation of cell type locations

首先,我们学习如何使用标准扫描绘图工具 sc.pl.spatial 和我们的自定义工具可视化细胞类型位置,该工具使用颜色插值 cell2location.plt.mapping_video.plot_spatial 在一个图中可视化几种细胞类型。
Cell2location 估计参考细胞类型的绝对细胞和 mRNA 丰度。 对于这两种测量,后验分布的 5% 分位数用于显示结果,代表细胞丰度和 mRNA 计数的置信水平。

For completeness, for each visium section, sc.pl.spatial was used to produce 4 figure panels showing the locations of all cell types (cell and mRNA abundance, 5% and the mean of the posterior distribution), saved to r['run_name']/plots/spatial/.

在这里,使用绝对细胞密度(5% 分位数)在一张图中可视化多种细胞类型的位置,代表这一点的模型参数称为 q05_spot_factors。 显示了映射到小鼠大脑 6 个不同区域的 6 种神经元和神经胶质细胞类型。

from cell2location.plt.mapping_video import plot_spatial

# select up to 6 clusters
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]

slide = select_slide(adata_vis, 'ST8059048')

with mpl.rc_context({'figure.figsize': (15, 15)}):
    fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
                  coords=slide.obsm['spatial'] \
                          * list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
                  show_img=True, img_alpha=0.8,
                  style='fast', # fast or dark_background
                  img=list(slide.uns['spatial'].values())[0]['images']['hires'],
                  circle_diameter=6, colorbar_position='right')
图片.png

We can produce this visualisation in dark background by setting style='dark_background' and hiding the image img_alpha=0.

with mpl.rc_context({'figure.figsize': (15, 15)}):
    fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
                  coords=slide.obsm['spatial'] \
                          * list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
                  show_img=True, img_alpha=0,
                  style='dark_background', # fast or dark_background
                  img=list(slide.uns['spatial'].values())[0]['images']['hires'],
                  circle_diameter=6, colorbar_position='right')
图片.png

现在,我们将细胞丰度估计(上图)与每种细胞类型的估计 mRNA 丰度进行比较。 这对于识别哪些细胞类型没有映射到特定组织通常很有用(mRNA 计数 < 50 - 注意颜色条上的最大值),代表这一点的模型参数称为 q05_nUMI_factors。

# select up to 6 clusters
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_nUMI_factors' + str(i) for i in sel_clust]

slide = select_slide(adata_vis, 'ST8059048')

with mpl.rc_context({'figure.figsize': (15, 15)}):
    fig = plot_spatial(slide.obs[sel_clust_col], labels=sel_clust,
                  coords=slide.obsm['spatial'] \
                          * list(slide.uns['spatial'].values())[0]['scalefactors']['tissue_hires_scalef'],
                  show_img=True, img_alpha=0.8, max_color_quantile=0.98,
                  img=list(slide.uns['spatial'].values())[0]['images']['hires'],
                  circle_diameter=6,  colorbar_position='right')
图片.png
#sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
#sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]

# select one section correctly subsetting histology image data
slide = select_slide(adata_vis, 'ST8059048')

# plot with nice names
with mpl.rc_context({'figure.figsize': (10, 10), "font.size": 18}):
    # add slide.obs with nice names
    slide.obs[sel_clust] = (slide.obs[sel_clust_col])

    sc.pl.spatial(slide, cmap='magma',
                  color=sel_clust[0:6], # limit size in this notebook
                  ncols=3,
                  size=0.8, img_key='hires',
                  alpha_img=0.9,
                  vmin=0, vmax='p98'
                 )
图片.png
Next, we show how to use the standard scanpy pipeline to plot cell locations over histology images (for more extensive information refer to scanpy):
sel_clust = ['Oligo_2', 'Inh_Meis2_3', 'Inh_4', 'Ext_Thal_1', 'Ext_L23', 'Ext_L56']
sel_clust_col = ['q05_spot_factors' + str(i) for i in sel_clust]

# select one section correctly subsetting histology image data
slide = select_slide(adata_vis, 'ST8059048')

# plot with nice names
with mpl.rc_context({'figure.figsize': (10, 10), "font.size": 18}):
    # add slide.obs with nice names
    slide.obs[sel_clust] = (slide.obs[sel_clust_col])

    sc.pl.spatial(slide, cmap='magma',
                  color=sel_clust[0:6], # limit size in this notebook
                  ncols=3,
                  size=0.8, img_key='hires',
                  alpha_img=0.9,
                  vmin=0, vmax='p99.2'
                 )
图片.png

Identifying tissue regions by clustering

We identify tissue regions that differ in their cell composition by clustering locations using cell abundance estimated by cell2location.

通过使用每种细胞类型的估计细胞丰度对 Visium 点进行聚类来找到组织区域。 我们构建了一个 K-nearest neigbour (KNN) 图,表示估计细胞丰度中位置的相似性,然后应用 Leiden 聚类。 KNN 邻居的数量应适应数据集的大小和解剖学定义区域的大小(即海马区域相当小,因此可能被大型 n_neighbors 掩盖)。 这可以针对范围 KNN 邻居和 Leiden 聚类分辨率完成,直到获得与组织解剖结构匹配的聚类

The clustering is done jointly across all Visium sections / batches, hence the region identities are directly comparable. When there are strong technical effects between multiple batches (not the case here) sc.external.pp.bbknn can be used to account for those effects during the KNN construction.

The resulting clusters are saved in adata_vis.obs['region_cluster'].

sample_type = 'q05_nUMI_factors'
col_ind = [sample_type in i for i in adata_vis.obs.columns.tolist()]
adata_vis.obsm[sample_type] = adata_vis.obs.loc[:,col_ind].values

# compute KNN using the cell2location output
sc.pp.neighbors(adata_vis, use_rep=sample_type,
                n_neighbors = 20)

# Cluster spots into regions using scanpy
sc.tl.leiden(adata_vis, resolution=1)

# add region as categorical variable
adata_vis.obs["region_cluster"] = adata_vis.obs["leiden"]
adata_vis.obs["region_cluster"] =  adata_vis.obs["region_cluster"].astype("category")

Visualise the regions in UMAP based on cell abundances and in 2D

在这里,我们使用相同的 KNN 图表示在细胞丰度方面的相似性位置,以执行所有位置的 UMAP 投影。 我们可以看到 cell2location 成功地整合了 2 个部分。 您可以看到皮层中具有类似位置的区域(下方的 2D)由来自两个样本的点组成(例如区域cluster 14、16、0 - 皮质层 L4、L5 和 L6)。

sc.tl.umap(adata_vis, min_dist = 0.3, spread = 1)

with mpl.rc_context({'figure.figsize': (8, 8)}):
    sc.pl.umap(adata_vis, color=['sample', 'region_cluster'], size=30,
               color_map = 'RdPu', ncols = 2, legend_loc='on data',
               legend_fontsize=10)
图片.png
# Plot the region identity of each location in 2D space
# Plotting UMAP of integrated datasets before 2D plots of separate sections ensures
# consistency of colour scheme via `adata_vis.uns['region_cluster_colors']`.
with mpl.rc_context({'figure.figsize': (5, 6)}):
    sc.pl.spatial(select_slide(adata_vis, 'ST8059048'),
                  color=["region_cluster"], img_key=None
                );
    sc.pl.spatial(select_slide(adata_vis, 'ST8059052'),
                  color=["region_cluster"], img_key=None
                )
图片.png

图片.png

######## Export regions for import to 10X Loupe Browser

我们的区域图可以在组织学图像上可视化,并使用 10X 放大镜浏览器进行交互探索(请参阅 10X 网站了解说明)。

# save maps for each sample separately
sam = np.array(adata_vis.obs['sample'])
for i in np.unique(sam):

    s1 = adata_vis.obs[['region_cluster']]
    s1 = s1.loc[sam == i]
    s1.index = [x[10:] for x in s1.index]
    s1.index.name = 'Barcode'

    s1.to_csv(results_folder +r['run_name']+'/region_cluster29_' + i + '.csv')
Identify groups of co-located cell types using matrix factorisation(识别共定位的细胞类型)

在这里,我们使用估计的细胞丰度作为非负矩阵分解的输入来识别共同定位的细胞类型 (R) 的组,这可以解释为细胞区室或组织区域。直观地,我们假设细胞相互作用可以驱动细胞类型丰度的线性依赖性,此外,我们观察到具有高度空间交错的细胞比对的组织,如人类淋巴结,用 NMF 比用离散cluster蛋白更好地描述。

Tip If you want to find a few most disctinct cellular compartments, use a small number of factors. If you want to find very strong co-location signal and assume that most cell types don’t co-locate, use a lot of factors (> 30 - used here). In practice, it is better to train NMF for a range of factors (R={5, .., 30}) and select (R) as a balance between capturing fine tissue zones and splitting known compartments

# number of cell type combinations - educated guess assuming that most cell types don't co-locate
n_fact = int(30)

# extract cell abundance from cell2location
X_data = adata_vis.uns['mod']['post_sample_q05']['spot_factors']

import cell2location.models as c2l
# create model class
mod_sk = c2l.CoLocatedGroupsSklearnNMF(n_fact, X_data,
        n_iter = 10000,
        verbose = True,
        var_names=adata_vis.uns['mod']['fact_names'],
        obs_names=adata_vis.obs_names,
        fact_names=['fact_' + str(i) for i in range(n_fact)],
        sample_id=adata_vis.obs['sample'],
        init='random', random_state=0,
        nmf_kwd_args={'tol':0.0001})

# train 5 times to evaluate stability
mod_sk.fit(n=5, n_type='restart')

现在,让我们检查一些诊断图。 首先,您可以看到大多数细胞类型组合在此模型的训练重新启动之间是一致的(具有高相关性的对角线)。 使用第一次重新启动(y 轴),因此我们可以注意到因子 21、23、25(基于 0)不是很稳健。

## Do some diagnostics
# evaluate stability by comparing trainin restarts

with mpl.rc_context({'figure.figsize': (10, 8)}):
    mod_sk.evaluate_stability('cell_type_factors', align=True)
图片.png

接下来,我们评估 NMF 细胞类型组在解释单个细胞类型的丰富程度方面的准确性。 您应该会看到一个对角线 2D 直方图,其中比较了输入细胞密度数据(X 轴)和模型的估算值(Y 轴)。 在这里,我们对低丰度细胞类型进行了一些小偏差。

# evaluate accuracy of the model
mod_sk.compute_expected()
mod_sk.plot_posterior_mu_vs_data()
图片.png

Finally, let’s investigate the composition of each NMF cell type group. We use our model to compute the relative contribution of NMF groups to each cell type ('cell_type_fractions' e.g. 45% of cell abundance of Astro_THAL_hab can be explained by fact_10). Note: factors are exchangeable so while you find consistent factors, each model training restart will output those factors in a different order.

Here we export these parameters from the model into adata_vis.uns['mod_sklearn'] in the spatial anndata object, and print the cell types most specific to each NMF group:

# extract parameters into DataFrames
mod_sk.sample2df(node_name='nUMI_factors', ct_node_name = 'cell_type_factors')

# export results to scanpy object
adata_vis = mod_sk.annotate_adata(adata_vis) # as columns to .obs
adata_vis = mod_sk.export2adata(adata_vis, slot_name='mod_sklearn') # as a slot in .uns

# print the fraction of cells of each type located to each combination
mod_sk.print_gene_loadings(loadings_attr='cell_type_fractions',
                         gene_fact_name='cell_type_fractions')
图片.png
# make nice names
from re import sub
mod_sk.cell_type_fractions.columns = [sub('mean_cell_type_factors', '', i)
                                      for i in mod_sk.cell_type_fractions.columns]

# plot co-occuring cell type combinations
mod_sk.plot_gene_loadings(mod_sk.var_names_read, mod_sk.var_names_read,
                        fact_filt=mod_sk.fact_filt,
                        loadings_attr='cell_type_fractions',
                        gene_fact_name='cell_type_fractions',
                        cmap='RdPu', figsize=[10, 15])
图片.png

Finally, we need to examine the abundance of each cell type group across locations:

# plot cell density in each combination
with mpl.rc_context({'figure.figsize': (5, 6), 'axes.facecolor': 'black'}):

    # select one section correctly subsetting histology image data
    slide = select_slide(adata_vis, 'ST8059048')

    sc.pl.spatial(slide,
                  cmap='magma',
                  color=mod_sk.location_factors_df.columns,
                  ncols=6,
                  size=1, img_key='hires',
                  alpha_img=0,
                  vmin=0, vmax='p99.2'
                 )
图片.png

Now we save the NMF model object to work with later (rememeber, every time you train the model, factors with the same composition will have a different order):

# save co-location models object
def pickle_model(mod, path, file_suffix=''):
    file = path + 'model_' + str(mod.__class__.__name__) + '_' + str(mod.n_fact) + '_' + file_suffix + ".p"
    pickle.dump({'mod': mod, 'fact_names': mod.fact_names}, file = open(file, "wb"))
    print(file)

pickle_model(mod_sk, results_folder +r['run_name'] + '/', file_suffix='')

To aid this analysis, we wrapped the analysis shown above into a pipeline that automates training the NMF model with varying number of factors (including export of the same plots and data as shown above).

from cell2location import run_colocation
res_dict, adata_vis = run_colocation(
                   adata_vis, model_name='CoLocatedGroupsSklearnNMF',
                   verbose=False, return_all=True,

                   train_args={
                    'n_fact': np.arange(10, 40), # IMPORTANT: range of number of factors (10-40 here)
                    'n_iter': 20000, # maximum number of training iterations

                    'sample_name_col': 'sample', # columns in adata_vis.obs that identifies sample

                    'mode': 'normal',
                    'n_type': 'restart', 'n_restarts': 5 # number of training restarts
                   },

                   model_kwargs={'init': 'random', 'random_state': 0, 'nmf_kwd_args': {'tol': 0.00001}},

                   posterior_args={},
                   export_args={'path': results_folder + 'std_model/'+r['run_name']+'/CoLocatedComb/',
                                'run_name_suffix': ''})

生活很好,有你更好

你可能感兴趣的:(10X单细胞空间联合分析之cell2location的详细梳理)