hello,又是周五了,比较忙,没有更,但是29周岁而立之年的焦虑还一直在,不知道怎么做才能缓解,好了,这一篇我们要分享一个新的方法,a Bayesian model for compositional single-cell data analysis,分享的文章在scCODA is a Bayesian model for compositional single-cell data analysis,2021年12月发表于NC,方法还是很不错的
研究背景
细胞类型的组成变化是生物过程的主要驱动力。 由于数据的组成性和样本量较小,很难通过单细胞实验检测它们。
单细胞 RNA 测序 (scRNA-seq) 的最新进展允许在广泛的组织中对单个细胞进行大规模定量转录分析,从而能够监测条件或发育阶段之间的转录变化以及数据驱动的识别不同的细胞类型。
尽管是疾病、发育、衰老和免疫等生物过程的重要驱动因素,但使用 scRNA-seq 检测细胞类型组成的变化并非易事。统计测试需要考虑技术和方法限制的多种来源,包括实验重复次数少。在大多数单细胞技术中,每个样本的细胞总数受到限制,这意味着细胞类型计数本质上是成比例的。反过来,这会导致细胞类型相关性估计出现负偏差。例如,如果只有一种特定的细胞类型在扰动后被耗尽,其他细胞的相对频率就会上升。如果从表面上看,这将导致不同细胞类型的膨胀。因此,独立测试每种细胞类型的组成变化的标准单变量统计模型可能错误地将某些群体变化视为真实效应,即使它们仅由细胞类型比例的固有负相关性引起。然而,目前应用于组成细胞类型分析的常见统计方法忽略了这种影响。
为了解释细胞类型组成中存在的固有偏差,从微生物组数据的组成分析方法中汲取灵感,并提出了一种用于细胞类型组成差异丰度分析的贝叶斯方法,以进一步解决低复制问题。单细胞成分数据分析 (scCODA) 框架使用分层 Dirichlet-Multinomial 分布对细胞类型计数进行建模,该分布通过对所有测量的细胞类型比例而不是通过联合建模来解释细胞类型比例的不确定性和负相关偏差个别的。该模型使用带有对数链接函数的 Logit 正态尖峰和平板先验,以简约的方式估计二元(或连续)协变量对细胞类型比例的影响。由于成分分析始终需要能够识别成分变化的参考,因此 scCODA 可以自动选择适当的细胞类型作为参考或使用预先指定的参考细胞类型。这意味着必须根据所选参考来解释 scCODA 检测到的可信变化。最重要的是,该框架提供了对其他完善的组合测试统计数据的访问,并完全集成到 Scanpy pipeline中。
代码示例
单细胞数据分析细胞比例的缺点
- scRNA-seq population data is compositional. This must be considered to avoid an inflation of false-positive results.
- Most datasets consist only of very few samples, making frequentist tests inaccurate.
- A condition usually only effects a fraction of cell types. Therefore, sparse effects are preferable.
The scCODA model overcomes all these limitations in a fully Bayesian model, that outperforms other compositional and non-compositional methods.(软件是python版本)
scCODA - Compositional analysis of single-cell data
# Setup
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import pickle as pkl
import matplotlib.pyplot as plt
from sccoda.util import comp_ana as mod
from sccoda.util import cell_composition_data as dat
from sccoda.util import data_visualization as viz
import sccoda.datasets as scd
load data
# Load data
cell_counts = scd.haber()
print(cell_counts)
Mouse Endocrine Enterocyte Enterocyte.Progenitor Goblet Stem
0 Control_1 36 59 136 36 239
1 Control_2 5 46 23 20 50
2 Control_3 45 98 188 124 250
3 Control_4 26 221 198 36 131
4 H.poly.Day10_1 42 71 203 147 271
5 H.poly.Day10_2 40 57 383 170 321
6 H.poly.Day3_1 52 75 347 66 323
7 H.poly.Day3_2 65 126 115 33 65
8 Salm_1 37 332 113 59 90
9 Salm_2 32 373 116 67 117
预处理
# Convert data to anndata object
data_all = dat.from_pandas(cell_counts, covariate_columns=["Mouse"])
# Extract condition from mouse name and add it as an extra column to the covariates
data_all.obs["Condition"] = data_all.obs["Mouse"].str.replace(r"_[0-9]", "")
For our first example, we want to look at how the Salmonella infection influences the cell composition. Therefore, we subset our data.
# Select control and salmonella data
data_salm = data_all[data_all.obs["Condition"].isin(["Control", "Salm"])]
viz.boxplots(data_salm, feature_name="Condition")
plt.show()
Model setup and inference
We can now create the model and run inference on it. Creating a sccoda.util.comp_ana.CompositionalAnalysis
class object sets up the compositional model and prepares everxthing for parameter inference. It needs these informations:
The data object from above.
The
formula
parameter. It specifies how the covariates are used in the model. It can process R-style formulas via the patsy package, e.g.formula="Cov1 + Cov2 + Cov3"
. Here, we simply use the “Condition” covariate of our datasetThe
reference_cell_type
parameter is used to specify a cell type that is believed to be unchanged by the covariates informula
. This is necessary, because compositional analysis must always be performed relative to a reference (See Büttner, Ostner et al., 2021 for a more thorough explanation). If no knowledge about such a cell type exists prior to the analysis, taking a cell type that has a nearly constant relative abundance over all samples is often a good choice. It is also possible to let scCODA find a suited reference cell type by usingreference_cell_type="automatic"
. Here, we take Goblet cells as the reference.
model_salm = mod.CompositionalAnalysis(data_salm, formula="Condition", reference_cell_type="Goblet")
sim_results = model_salm.sample_hmc()
Result interpretation
sim_results.summary()
Compositional Analysis summary:
Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition
Intercepts:
Final Parameter Expected Sample
Cell Type
Endocrine 1.102 34.068199
Enterocyte 2.328 116.089840
Enterocyte.Progenitor 2.523 141.085258
Goblet 1.753 65.324318
Stem 2.705 169.247878
TA 2.113 93.631267
TA.Early 2.861 197.821355
Tuft 0.449 17.731884
Effects:
Final Parameter Expected Sample \
Covariate Cell Type
Condition[T.Salm] Endocrine 0.0000 24.315528
Enterocyte 1.3571 321.891569
Enterocyte.Progenitor 0.0000 100.696915
Goblet 0.0000 46.623988
Stem 0.0000 120.797449
TA 0.0000 66.827533
TA.Early 0.0000 141.191224
Tuft 0.0000 12.655794
log2-fold change
Covariate Cell Type
Condition[T.Salm] Endocrine -0.486548
Enterocyte 1.471333
Enterocyte.Progenitor -0.486548
Goblet -0.486548
Stem -0.486548
TA -0.486548
TA.Early -0.486548
Tuft -0.486548
Intercepts
The first column of the intercept summary shows the parameters determined by the MCMC inference.
The “Expected sample” column gives some context to the numerical values. If we had a new sample (with no active covariates) with a total number of cells equal to the mean sampling depth of the dataset, then this distribution over the cell types would be most likely.
Effects
For the effect summary, the first column again shows the inferred parameters for all combinations of covariates and cell types. Most important is the distinctions between zero and non-zero entries A value of zero means that no statistically credible effect was detected. For a value other than zero, a credible change was detected. A positive sign indicates an increase, a negative sign a decrease in abundance.
Since the numerical values of the “Final parameter” columns are not straightforward to interpret, the “Expected sample” and “log2-fold change” columns give us an idea on the magnitude of the change. The expected sample is calculated for each covariate separately (covariate value = 1, all other covariates = 0), with the same method as for the intercepts. The log-fold change is then calculated between this expected sample and the expected sample with no active covariates from the intercept section. Since the data is compositional, cell types for which no credible change was detected, are will change in abundance as well, as soon as a credible effect is detected on another cell type due to the sum-to-one constraint. If there are no credible effects for a covariate, its expected sample will be identical to the intercept sample, therefore the log2-fold change is 0.
Interpretation
In the salmonella case, we see only a credible increase of Enterocytes, while all other cell types are unaffected by the disease. The log-fold change of Enterocytes between control and infected samples with the same total cell count lies at about 1.54.
Adjusting the False discovery rate
scCODA selects credible effects based on their inclusion probability. The cutoff between credible and non-credible effects depends on the desired false discovery rate (FDR). A smaller FDR value will produce more conservative results, but might miss some effects, while a larger FDR value selects more effects at the cost of a larger number of false discoveries.
The desired FDR level can be easily set after inference via sim_results.set_fdr()
. Per default, the value is 0.05, but we recommend to increase it if no effects are found at a more conservative level.
In our example, setting a desired FDR of 0.4 reveals effects on Endocrine and Enterocyte cells.
sim_results.set_fdr(est_fdr=0.4)
sim_results.summary()
Compositional Analysis summary (extended):
Data: 6 samples, 8 cell types
Reference index: 3
Formula: Condition
Spike-and-slab threshold: 0.434
MCMC Sampling: Sampled 20000 chain states (5000 burnin samples) in 79.348 sec. Acceptance rate: 51.9%
Intercepts:
Final Parameter HDI 3% HDI 97% SD \
Cell Type
Endocrine 1.102 0.363 1.740 0.369
Enterocyte 2.328 1.694 2.871 0.314
Enterocyte.Progenitor 2.523 1.904 3.088 0.320
Goblet 1.753 1.130 2.346 0.330
Stem 2.705 2.109 3.285 0.318
TA 2.113 1.459 2.689 0.332
TA.Early 2.861 2.225 3.378 0.307
Tuft 0.449 -0.248 1.207 0.394
Expected Sample
Cell Type
Endocrine 34.068199
Enterocyte 116.089840
Enterocyte.Progenitor 141.085258
Goblet 65.324318
Stem 169.247878
TA 93.631267
TA.Early 197.821355
Tuft 17.731884
Effects:
Final Parameter HDI 3% HDI 97% \
Covariate Cell Type
Condition[T.Salm] Endocrine 0.327533 -0.506 1.087
Enterocyte 1.357100 0.886 1.872
Enterocyte.Progenitor 0.000000 -0.395 0.612
Goblet 0.000000 0.000 0.000
Stem -0.240268 -0.827 0.168
TA 0.000000 -0.873 0.252
TA.Early 0.000000 -0.464 0.486
Tuft 0.000000 -1.003 0.961
SD Inclusion probability \
Covariate Cell Type
Condition[T.Salm] Endocrine 0.338 0.457133
Enterocyte 0.276 0.998400
Enterocyte.Progenitor 0.163 0.338200
Goblet 0.000 0.000000
Stem 0.219 0.434800
TA 0.220 0.364000
TA.Early 0.128 0.284733
Tuft 0.319 0.392533
Expected Sample log2-fold change
Covariate Cell Type
Condition[T.Salm] Endocrine 34.413767 0.014560
Enterocyte 328.331183 1.499910
Enterocyte.Progenitor 102.711411 -0.457971
Goblet 47.556726 -0.457971
Stem 96.897648 -0.804604
TA 68.164454 -0.457971
TA.Early 144.015830 -0.457971
Tuft 12.908980 -0.457971
数据可视化
# Stacked barplot for each sample
viz.stacked_barplot(data_mouse, feature_name="samples")
plt.show()
# Stacked barplot for the levels of "Condition"
viz.stacked_barplot(data_mouse, feature_name="Condition")
plt.show()
# Grouped boxplots. No facets, relative abundance, no dots.
viz.boxplots(
data_mouse,
feature_name="Condition",
plot_facets=False,
y_scale="relative",
add_dots=False,
)
plt.show()
# Grouped boxplots. Facets, log scale, added dots and custom color palette.
viz.boxplots(
data_mouse,
feature_name="Condition",
plot_facets=True,
y_scale="log",
add_dots=True,
cmap="Reds",
)
plt.show()
Finding a reference cell type
The scCODA model requires a cell type to be set as the reference category. However, choosing this cell type is often difficult. A good first choice is a referenece cell type that closely preserves the changes in relative abundance during the compositional analysis.
For this, it is important that the reference cell type is not rare, to avoid large relative changes being caused by small absolute changes. Also, the relative abundance of the reference should vary as little as possible across all samples.
The visualization viz.rel_abundance_dispersion_plot
shows the presence (share of non-zero samples) over all samples for each cell type versus its dispersion in relative abundance. Cell types that have a higher presence than a certain threshold (default 0.9) are suitable candidates for the reference and thus colored.
viz.rel_abundance_dispersion_plot(
data=data_mouse,
abundant_threshold=0.9
)
plt.show()
Diagnostics and plotting
Similarly to the summary dataframes being compatible with arviz, the result class itself is an extension of arviz’s Inference Data
class. This means that we can use all its MCMC diagnostic and plotting functionality. As an example, looking at the MCMC trace plots and kernel density estimates, we see that they are indicative of a well sampled MCMC chain:
Note: Due to the spike-and-slab priors, the beta
parameters have many values at 0, which looks like a convergence issue, but is actually not.
Caution: Trying to plot a kernel density estimate for an effect on the reference cell type results in an error, since it is constant at 0 for the entire chain. To avoid this, add coords={"cell_type": salm_results.posterior.coords["cell_type_nb"]}
as an argument to az.plot_trace
, which causes the plots for the reference cell type to be skipped.
az.plot_trace(
salm_results,
divergences=False,
var_names=["alpha", "beta"],
coords={"cell_type": salm_results.posterior.coords["cell_type_nb"]},
)
plt.show()
示例代码的网址在scCODA
最后,看一看算法
生活很好,有你更好