10X单细胞空间联合分析之十（RCTD）

hello,上一篇文章中，10X空间转录组数据分析之空间注释（解卷积，STdeconvolve）中提到了单细胞空间联合的分析方法，SPOTlight，RCTD，其中SPOTlight已经分享过了,今天加把劲，把单细胞空间联合的分析方法RCTD也分享给大家，供大家认真学习和参考。

RCTD方法的文章Robust decomposition of cell type mixtures in spatial transcriptomics，2021年2月发表于nature biotechnology。IF = 54分，医学类杂志分数都普遍较高，方法的网址在RCTD,下面来几张分析的效果图：

图片.png

当然，方法的灵活运用还是要看我们自己，好了，开始我们这次的分享。

Abstract

空间转录组学技术的局限性在于，单个测量值可能包含来自多个细胞的贡献，阻碍了细胞类型特异性定位和表达空间模式的发现。在这里，作者开发了稳定的细胞类型分解软件(RCTD)，这是一种计算方法，它利用从单细胞 RNA-seq 中学到的细胞类型谱来分解细胞类型混合物（还是上一篇说的不错，很依赖注释好的单细胞数据），同时校正不同测序技术的差异。文中展示了 RCTD 在模拟数据集上检测混合物和识别细胞类型的能力。此外，RCTD 在小鼠大脑的 Slide-seq 和 Visium 数据集中准确地再现了已知的细胞类型和亚型定位模式。最后，分析展示了 RCTD 对细胞类型定位的恢复如何能够在细胞类型中发现基因，其表达取决于空间环境。使用 RCTD 对细胞类型进行空间映射可以定义细胞身份的空间成分，揭示生物组织中细胞组织的新原理。

Introduction

组织由不同的细胞类型和状态组成，其空间组织控制着相互作用和功能。空间转录组学技术的最新进展使 RNA 测序与生物组织中的空间信息的高通量收集成为可能。使用此类技术对细胞类型进行空间映射是了解组织结构的基础。特别是，特定细胞亚型的空间定位知识仍然不完整且难以获得。

空间转录组学技术有可能阐明细胞环境和基因表达之间的相互作用，增加对健康功能和组织疾病状态的了解。空间转录组学数据由每个空间测量位置的基因表达计数组成，这里称为像素（其实就是我们空间数据的spot），平铺二维 (2D) 表面。一项常见的任务是识别表达随空间变化的基因。当前的计算方法在不按细胞类型分层的情况下搜索基因表达的空间模式。然而，这些方法检测到的大部分变异可能是由空间景观中不同的细胞类型组成驱动的，因为单细胞 RNA 测序 (scRNA-seq) 研究表明，细胞类型可以解释群体中的大部分变异。因此，在搜索空间基因表达模式时，有必要考虑细胞类型信息。

即使对于高分辨率方法（例如 Slide-seq），细胞类型的分配在分析上也具有挑战性，因为尽管像素分辨率可以接近哺乳动物细胞的大小（例如，Slide-seq，10 μm），固定像素位置可能与多个细胞重叠。因此，单个像素的基因表达测量可能是多种细胞类型混合的结果。目前，最广泛使用的识别细胞类型的方法依赖于无监督聚类。然而，这种方法不允许细胞类型混合的可能性。因此，一个基本的挑战是将这些混合像素正确识别为多种细胞类型的组合，从而可以更完整地表征空间转录组学中细胞类型的空间定位。

最近的几种方法使用 scRNA-seq 参考来预测空间转录组学数据上的细胞类型；然而，尽管最近有证据表明，scRNA-seq 数据分析中解释基因表达计数数据这些方面的方法优于假设正态性的方法，但有些并未对平台效应、泊松采样和过度分散计数进行统计建模。其他尚未被证明可以扩展到大型数据集的数据集，例如通过 Slide-seq 获得的数据集，并且它们在区域而不是单个像素的水平上执行细胞类型富集测试。

在这里，开发介绍了 RCTD，一种将 RNA 测序混合物分解为单个细胞类型的监督学习方法（就是利用单细胞数据来注释空间转录组），从而能够将细胞类型分配给空间转录组像素。具体来说，利用带注释的 scRNA-seq 数据来定义空间转录组学数据中预期存在的细胞类型的细胞类型特异性概况。几种有监督的细胞类型分配方法在 scRNA-seq 中取得了很高的准确性，但不是为多种细胞类型的混合物设计的。 RCTD 拟合一个统计模型，该模型估计每个像素的细胞类型混合。

监督细胞类型学习的一个相关挑战是所说的平台效应，即依赖于技术的文库制备对测序平台之间单个基因捕获率的影响。分析表明，如果不考虑这些平台效应，监督方法不太可能成功，因为系统技术可变性主导了相关的生物信号。这些影响先前已在同一生物样本上的单细胞和单核 RNA-seq (snRNA-seq) 之间的比较中发现，其中已经表明，例如，核定位基因富含 snRNA-seq . 在这里，证明了 scRNA-seq 参考和空间转录组学目标之间的平台效应在将细胞类型知识转移到空间转录组学时提出了挑战。为了在 RCTD 中实现跨平台学习，作者开发并验证了平台效应规范化程序。

文中的例子证明 RCTD 可以准确地发现模拟和真实空间转录组数据中细胞类型的定位。此外，RCTD 可以检测细微的转录组差异以在空间上映射细胞亚型。最后，使用 RCTD 来计算预期的细胞类型特异性基因表达，从而能够根据细胞的空间环境检测基因表达的变化。分析部分展示了 RCTD 如何学习空间转录组学数据中细胞类型的混合物，促进空间位置和局部细胞环境对细胞类型内基因表达的影响的量化。

Results

Challenges in spatial transcriptomics: cell type mixtures and platform effects.

空间转录组学像素从多个而不是单个细胞中获取 RNA，这为细胞类型学习带来了挑战。在 Slide-seq 小脑数据中，发现最广泛使用的 scRNA-seq 细胞类型识别方法，即无监督聚类，错误地对空间共定位但转录不相似的细胞类型进行了分类。例如，Bergmann and Purkinje cells在空间上共定位于同一层，从而产生具有两种细胞类型标记基因的cluster(下图)

图片.png

注：Expression of Bergmann and Purkinje cell marker genes for pixels colored by unsupervised clustering cell type assignment within a Slide-seq cerebellum dataset. For example, the Bergmann cell markers axis is the sum of the expression (counts per 500) of Bergmann cell differentially expressed genes。

对这一观察结果最可能的解释是这些像素包含两个或多个不同类型的细胞，但无监督聚类将这些双重像素仅分配给一种细胞类型。此外，这种方法不仅预测颗粒层中的颗粒细胞，许多细胞在分子层和少突胶质细胞层内被错误预测，并且颗粒细胞标记物表达较低。

图片.png

注：b, Expression (counts per 500) of granule cell marker genes in Slide-seq; scale bar, 250 μm. c, Spatial plot of granule cells identified by unsupervised clustering. Pixels are colored by whether they spatially belong to the granule cell layer; scale bar, 250 μm

另一个挑战，平台效应，出现在应用监督学习中，其中利用 scRNA-seq 细胞类型谱对空间转录组细胞类型进行分类。例如，在具有已知细胞类型的评估 snRNA-seq 小脑数据集上训练的标准监督学习方法在训练平台中获得的准确度远高于测试平台 scRNA-seq 小脑数据集。

图片.png

注：d, Confusion matrix of true versus predicted cell types within a training dataset (snRNA-seq) by non-negative least squares regression. Color represents the proportion of the cell type on the x axis classified as the cell type on the y axis. The diagonal line representing ground truth is boxed in red. e, Confusion matrix of cell type predictions across platforms using non-negative least-squares regression trained on snRNA-seq data and tested on scRNA-seq data. The same color scale was used as in d

这种差异可以通过平台效应的存在来解释，平台效应会导致基因表达在 snRNA-seq 和 scRNA-seq 之间成倍变化（看来大家在选择单细胞转录组还是单细胞核转录组的时候一定要慎重了）。

图片.png

注：Density plot across genes of measured platform effects between cerebellum scRNA-seq and snRNA-seq data. The platform effect is defined as the log2 ratio of average gene expression between platforms. UBCs, unipolar brush cells.

NMFreg 是一种之前为 Slide-seq 开发的监督细胞类型混合分配算法(关于非负矩阵分解的内容，我分享了很多了，大家可以看看我之前分享的文章)，也没有考虑平台效应。在 Slide-seq 小脑数据集上进行测试时，NMFreg 自信地将少数（n = 11,626 的 24.8%）SPOT分配给细胞类型和错误定位的广泛细胞类型类别。同样，DWLS 是一种为批量 RNA-seq 反卷积设计的方法，它没有考虑平台效应，并且在参考内细胞类型分类方面比跨平台细胞类型分类表现更好。

RCTD enables cross-platform detection of cell type mixtures.

为了解决这些挑战，RCTD 考虑了平台效应，同时使用 scRNA-seq 参考将每个空间转录组学像素分解为单个细胞类型的混合物。 RCTD 首先计算带注释的 scRNA-seq 参考中每种细胞类型的平均基因表达谱。接下来，RCTD 通过将每个空间转录组学像素拟合为单个细胞类型的线性组合来创建细胞类型的空间图。 RCTD 将每个spot的 RNA 测序计数作为输入，并假设多个细胞的未知混合物

图片.png

每种细胞类型对每个基因都贡献了未观察到的计数比例。 RCTD 通过拟合统计模型来估计每个spot的每种细胞类型的比例，其中，对于每个spot i 和基因 j，假设观察到的基因计数 Y 服从泊松分布。rate parameter由spot的总转录本计数、N 和 λ（K 细胞类型表达谱的混合物）决定： (数学方面的知识，大家尽量多多学习)

图片.png

为了考虑平台效应和其他自然变异源，例如空间变异性，我们假设 λ 是一个随机变量，定义为：

图片.png

where μ represents the mean gene expression profile for cell type k, α is a fixed pixel-specific effect, γ is a gene-specific platform random effect and ε is a random effect to account for gene-specific overdispersion.

使用最大似然估计 (MLE) 来推断细胞类型比例 β，表明每个像素中存在哪些细胞类型(后面方法中讨论)。 RCTD 可以在不限制每个像素的细胞类型数量的情况下使用，或者使用双合模式，它搜索每个像素的最佳拟合一种或两种细胞类型。特别是，如果像素仅包含一种细胞类型，将像素称为单峰，如果它们包含两种细胞类型，我们将像素称为双峰。如我们在 Slide-seq 中发现的那样，如果预计三种或更多细胞类型的混合物很少见，双峰模式可能会减轻过度拟合。软件还扩展了双合模式，可选择每个像素适合两种以上的细胞类型

由于无法从原始数据中观察到基因特异性平台效应，作者开发了一个程序来估计具有 RCTD 的测序平台之间的平台效应。在 snRNA-seq 小脑参考和 scRNA-seq 小脑数据集上训练 RCTD，验证了RCTD的方法能够可靠地恢复平台效应（R2 = 0.90）

图片.png

注： Scatter plot of measured versus predicted platform effect (by RCTD) for each gene between the single-cell and single-nucleus cerebellum datasets. The line represents the identity line. The measured platform effect is calculated as the log2 ratio of average gene expression between platforms.

After normalizing cell type profiles for platform effects, RCTD achieved high cross-platform single-cell classification accuracy (89.5% of n = 3,960 cells) Transcriptomically similar cell types, for example, oligodendrocytes/polydendrocytes, accounted for most of the remaining errors (91.8% of n = 415 errors).

图片.png

注：Confusion matrix for RCTD’s performance on cross-platform (trained on snRNA-seq data and tested on scRNA-seq data) cell type assignments for single cells. Color represents the proportion of the cell type on the x axis classified as the cell type on the y axis. The diagonal representing ground truth is boxed in red.

由于空间转录组学数据集中的真实细胞类型身份未知，使用从先前研究中获得的真实细胞类型对 RCTD 在 snRNA-seq 和 scRNA-seq 数据集上的性能进行了基准测试。为了评估 RCTD 在存在平台效应的情况下检测和分解空间转录组学数据中的混合物的能力，我们在 snRNA-seq 小脑参考上训练了 RCTD，并在模拟为单细胞与已知细胞类型的计算混合物的双联数据集上进行了测试。 scRNA-seq 数据集。通过改变真实的基础细胞类型比例，观察到 RCTD 以高精度正确分类了单峰 (89.2% ± 0.5% s.e.) 和双峰 (81.1% ± 0.3% s.e.)

图片.png

很大一部分双峰错误分类来自出现在同一双峰像素上的转录相似细胞类型，RCTD 经常将其错误分类 (87.0% ± 1.2% s.e.) 为单峰。此外，RCTD 以 98.2% 的准确率识别每个双峰上存在的每个细胞类别（方法；跨 66 个细胞类型对为 ±2.8% s.d.）

图片.png

注：On simulated doublets of cell class 1 and cell type 2, the percentage of confident calls by RCTD that correctly identifies the cell class, where cell classes group four pairs of transcriptionally similar cell types based on a previous dendrogram20 (polydendrocytes/oligodendrocytes, molecular layer interneurons 1 and 2 (MLI1/MLI2), Bergmann cells/astrocytes and endothelial cells/fibroblasts). Column represents cell class 1, and color represents cell type 2

最后，RCTD 以 12.8% r.m.s.e. 准确估计了样本中每种细胞类型的比例。（±6.9% s.d. 跨越 66 个细胞类型对）（下图）。这些技术验证表明，RCTD 可以准确地学习单个细胞混合数据集中的细胞类型信息。

图片.png

注：c, On simulated Bergmann–Purkinje doublets, the predicted Bergmann cell proportions by RCTD are shown. The x axis represents the true proportion of UMIs sampled from the Bergmann cell. The red line is the identity line, and the blue line is the average and s.d. (n = 30 simulations per condition) of RCTD’s prediction. d, For each pair of cell types, the r.m.s.e. of predicted versus true cell type proportion (as in c) by RCTD on simulated doublets (n = 390 simulations per cell type pair) is displayed. Column represents cell type 1, and color represents cell type 2.

接下来，将模拟数据上的 RCTD 验证扩展到其他情况，包括每像素不同的唯一分子标识符 (UMI) 计数、每像素超过两种细胞类型以及参考中缺失的细胞类型。我们发现每个像素的额外 UMI 导致置信率增加，并且 RCTD 对包含≥100 个 UMI 的像素实现了高分类精度。此外，我们发现 RCTD 能够准确预测包含三种或四种细胞类型的像素上的细胞类别比例，这是低分辨率空间转录组学中的典型方案。当参考中缺少模拟空间数据中的细胞类型时，RCTD 将像素归类为参考中转录最相似的细胞类型（如果可用）。当参考中没有可用的最接近的细胞类型时，RCTD 以降低的置信率预测细胞类型，但经常错误分类此类像素（我们更加希望的是，没有这种细胞类型的时候预测报错）。

RCTD localizes cell types in spatial transcriptomics data.

接下来应用 RCTD 来分配和分解空间转录组学数据中的细胞类型。我们首先应用 RCTD 来定位小鼠小脑中的细胞类型，使用 snRNA-seq 参考进行训练和在成年小鼠小脑上收集的 Slide-seqV2 数据集作为目标。 RCTD 自信地对大多数像素（n = 11,626 的 86.9%）进行了分类，并且由此产生的细胞类型调用与小脑的空间结构一致

图片.png

注：RCTD’s spatial map of cell type assignments in the cerebellum. Of 19 cell types, the 7 most common appear in the legend

由于空间转录组学数据不存在真实细胞类型标签，为了评估 RCTD 的准确性，我们使用了多种验证策略，包括与标记基因的比较和空间组织的先验知识。虽然应该预期标记基因的存在大致对应于细胞类型的存在，但我们不期望完美的关系，因此结合先前的生物学知识寻找标记基因的存在。我们首先考虑了Purkinje/Bergman cells，这两种细胞类型在小脑中空间共定位。我们发现分配给Purkinje/Bergman细胞类型的 RCTD 的单线态像素不具有其他细胞类型的标记。

图片.png

Moreover, pixels predicted as doublets contained marker genes of both Bergmann and Purkinje cells, with estimated cell type proportion correlating with marker gene ratio。接下来观察到 RCTD 正确地将分子层中间神经元定位到分子层，颗粒细胞定位到颗粒层，少突胶质细胞定位到白质层，RCTD 的分配与每种细胞类型的标记基因之间的空间对应关系进一步支持了预测

图片.png

接下来，为了验证 RCTD 正确定位双峰的能力，我们利用了小脑的分层组织

图片.png

注：Top, schematic of spatial cell type organization within the cerebellum. Bottom, connectivity graph of cell types that are likely to spatially colocalize.

RCTD found doublets within a layer and between adjacent layers but rarely between spatially separated layers

图片.png

为了测试 RCTD 在多个数据集上训练时是否获得一致的结果，我们另外在 scRNA-seq 小脑数据集上训练了 RCTD，并在 Slide-seq 小脑数据上进行了测试。在可靠分类的像素上，RCTD 在两个不同的参考资料上进行了训练，同意 95.7% 的细胞类型预测。我们还验证了 RCTD 在小鼠体感皮层的 Slide-seqV2 数据集中重现皮层细胞类型分层结构的能力，在 Smart-seq2 参考上进行训练我们发现皮层神经元亚型出现在适当的层中，具有 L2/3 端脑内（ IT)，接着是 L4，接着是 L5 IT 和 L5 锥体束 (PT)，接着是 L6 皮质丘脑 (CT) 和 L6 IT，接着是 L6b，这与其他研究的结果一致

RCTD discovers spatial localization of cellular subtypes.

接下来，我们测试了 RCTD 描绘细胞亚型空间定位的能力，正如最近通过大规模转录组学分析所定义的，对于这些细胞亚型在其常驻组织中的空间位置了解有限。为此，我们验证了 RCTD 对先前定义的海马中间神经元亚型进行分类的能力。我们首先使用 RCTD 在小鼠海马体的 Slide-seq 数据中对细胞类型进行空间注释

图片.png

注：RCTD’s spatial map of predicted cell types in the hippocampus. Of 17 cell types, the 8 most common appear in the legend。

在 scRNA-seq 海马数据集上进行训练。我们发现 RCTD 正确定位海马细胞类型。我们还验证了 RCTD 在 Visium 空间转录组学数据集中定位海马细胞类型的能力，并发现 Slide-seq 和 Visium 上预测的细胞类型定位模式之间的定性一致性。然后我们观察了分配给广泛类别的中间神经元的像素的空间聚类

图片.png

我们推断它来自大的单个中间神经元细胞，这一推断得到了组织学检查的支持。因此，我们测试了 RCTD 在将一个簇内的像素分配给同一中间神经元子类时的性能，并发现同一空间cluster内的置信像素之间的粗略子类分类具有很高的一致性（97.1% ± 0.09% s.e.）

图片.png

Additionally, we found that the spatial localization of the basket/OLM subclass coincided with the expression of Sst, a differentially expressed gene for this subclass

图片.png

Finally, we used RCTD to assign each spatial cluster to 1 of 27 transcriptomically defined interneuron subtypes, confidently classifying the majority of interneuron pixels。

图片.png

已知亚型的定位，例如出现在 CA1 的腔隙分子层 (SLM) 层中的 CA1 腔隙 (参考) 和主要出现在层层 (SO) 中的 OLM，与已知的解剖结构一致。我们得出结论，RCTD 能够识别空间转录组学数据中细胞亚型的空间位置。

RCTD enables detection of spatially variable genes within cell type.（检测空间高变基因）

以前的计算方法在不包含细胞类型信息的情况下搜索空间可变基因。然而，由于细胞类型在空间中分布不均，并且不同的细胞类型具有不同的表达谱，这种方法可能会导致将细胞类型标记基因与空间可变基因混淆。例如，我们发现 Slide-seq 海马数据中空间自相关性最高的 20 个基因主要仅在少数细胞类型中表达，表明它们的空间变异部分由细胞类型组成驱动。在调节细胞类型后，这些基因中的大多数表现出小的剩余空间变异。例如，Ptk2b 在兴奋性神经元中差异表达，但没有表现出任何无法单独由细胞类型解释的空间变化

图片.png

相反，RCTD 能够估计每种细胞类型内的空间基因表达模式。确定细胞类型后，我们使用 RCTD 计算每个像素内每种细胞类型的预期细胞类型特异性基因表达。使用这种特定于细胞类型的预期基因表达，我们检测到 CA3 锥体神经元内具有大空间变异的基因。对于这些基因，我们通过局部加权回归恢复了空间上基因表达的平滑模式

图片.png

除了空间可变基因外，RCTD 还可用于检测细胞环境对基因表达的影响。在海马体中，RCTD 在不同的空间区域检测到具有多种细胞类型的星形胶质细胞双联体；我们假设星形胶质细胞转录组可能因细胞环境而异。我们检测到其在星形胶质细胞中的表达依赖于与另一种细胞类型的共定位的基因。例如，我们发现 Entpd2 富含与齿状神经元共定位的星形胶质细胞（P = 0.025，双尾 z 检验）。这与先前的研究一致，该研究在表达 Entpd2 的齿状核中检测到星形胶质细胞样祖细胞群。此外，能够摄取 GABA 神经递质并可能调节抑制性突触的 Slc6a11 在兴奋性神经元周围的星形胶质细胞中差异表达（P < 10-6，双尾 z 检验）。因此，RCTD 能够测量细胞环境和空间对基因表达的影响

图片.png

Discussion

细胞类型的准确空间映射和基因表达的细胞类型特异性空间模式的检测对于理解组织组织和功能至关重要。在这里，介绍了 RCTD，这是一种使用针对平台效应标准化的 scRNA-seq 参考将空间转录组像素准确分解为细胞类型混合物的计算方法。 RCTD 将包含多个细胞的未知混合物的每个像素的 RNA 测序计数作为输入，并预测每个像素上每种细胞类型的比例。 RCTD 准确映射细胞类型，如模拟双联体数据集以及小脑和海马空间转录组学数据集所示。我们还证明了 RCTD 在 Visium 海马空间转录组学数据集中正确定位亚型的能力，表明 RCTD 可以广泛应用于不同的平台。我们进一步表明 RCTD 可以在空间上定位海马中间神经元的转录组学定义的细胞亚型。最后，我们证明了 RCTD 能够发现海马细胞类型内空间变化的基因表达。

随着测序成本的降低，scRNA-seq 数据集变得越来越普遍且更容易生成。单个 scRNA-seq 方法在其平台效应方面可能或多或少类似于空间转录组学数据集，这可以通过 RCTD 进行测量。例如，相对于 Slide-seq，我们发现单细胞海马参考的平台效应比单核小脑参考的平台效应要低。虽然空间平台效应很难先验测量，但我们已经证明我们的平台效应归一化程序对参考（scRNA-seq、snRNA-seq、Smart-seq）的选择具有鲁棒性。因此，我们预计 RCTD 应该与未来的 scRNA-seq 模式兼容。此外，我们的方法可以灵活地选择目标平台。例如，我们估计平台效应的过程仅取决于将所有像素合并为一个伪批量测量。因此，我们的方法可用于估计从 scRNA-seq 参考到任何其他测序技术（包括批量 RNA 测序）的平台效应，为 RNA 测序提供普遍适用的归一化程序。尽管受空间转录组学的启发，我们希望 RCTD 可以在其他非空间数据集上学习单个细胞或多种细胞类型的混合细胞类型。

RCTD 的一个限制是它依赖于平台效应在细胞类型之间共享的假设。这是基于参考的细胞类型学习的普遍问题，在未来的工作中探索学习细胞类型特定的平台效应将很重要。我们还发现 RCTD 的一个具有挑战性的问题是参考中缺少但存在于空间数据中的细胞类型。这个问题可以通过裁剪空间数据以排除先验已知的主要包含参考中不存在的细胞类型的区域来缓解。未来的工作包括改进我们的方法来识别具有参考之外的细胞类型的像素。

当精细的空间分辨率导致三种或更多细胞类型定位到一个像素不常见时（例如，Slide-seq），我们建议使用 RCTD 的双峰模式，它限制每个像素最多两种细胞类型。否则，RCTD 可用于分解每个像素的任意数量的细胞类型（例如，Visium）。原理上类似于 Akaike 信息准则 (AIC) 模型选择方法，双峰模式通过惩罚使用的细胞类型数量来减少过度拟合，提高 RCTD 的统计能力。这个概念可以很容易地扩展到triplets及以后的工作中。

空间转录组学的一个主要目标是了解细胞类型和细胞环境对细胞状态的影响。 RCTD 通过计算每个空间转录组学像素的预期细胞类型特异性基因表达来促进这些效应的发现。例如，分析了星形胶质细胞内的基因表达，以检测受局部细胞环境影响的星形胶质细胞基因。基因依赖细胞环境的驱动因素有很多，包括细胞间相互作用、区域信号因子或发育过程中的细胞历史。 RCTD 独特地定位细胞类型的能力使关于空间和环境对基因表达的影响的生物学相关假设的高通量生成成为可能。随着更多空间转录组学数据集的产生，预计 RCTD 将有助于发现生物组织中细胞组织的新原理。

Method（数学的知识大家多多学习，很多公式写起来很麻烦，就直接截图了）。

图片.png

看看示例代码,示例代码很多，我们就分享其中的一部分。

RCTD on the Hippocampus and spatially localizing three interneuron subclasses

library(RCTD)
library(Matrix)
library(ggplot2)
library(ggpubr)
library(gridExtra)
library(reshape2)
library(Seurat)
source('../../R/RCTD_helper.R')
source('../../R/IRWLS.R')
source('../../R/prob_model.R')

Load in results of RCTD and select interneurons

#given a puck object, returns a puck with counts filtered based on UMI threshold and gene list
restrict_counts <- function(puck, gene_list, UMI_thresh = 1, UMI_max = 20000) {
  keep_loc = (puck@nUMI >= UMI_thresh) & (puck@nUMI <= UMI_max)
  puck@counts = puck@counts[gene_list,keep_loc]
  if(length(puck@cell_labels) > 0) #check cell_labels non null
    puck@cell_labels = puck@cell_labels[keep_loc]
  puck@nUMI = puck@nUMI[keep_loc]
  return(puck)
}

#Command used to save the data from the gather_results.R script:
#save(puck_d, iv, results, file = 'Data/SpatialRNA/Puck_200115_08/results/final_analysis/gathered_results.RData')
#loading in that data:
refdir = '../../Data/Reference/DropVizHC'
load('../../Data/SpatialRNA/Puck_200115_08/results/final_analysis/gathered_results.RData')
results_df <- results$results_df
barcodes <- rownames(results_df)
singlet_ind = results_df$first_type == "Interneuron" & results_df$spot_class == "singlet"
singlet_barcodes <- barcodes[singlet_ind]
doublet_barcodes <- c(barcodes[results_df$first_type == "Interneuron" & results_df$spot_class == "doublet_certain"], barcodes[results_df$first_type == "Interneuron" & results_df$spot_class == "doublet_uncertain"],
                      barcodes[results_df$second_type == "Interneuron" & results_df$spot_class == "doublet_certain"])
doub_first <- c(barcodes[results_df$first_type == "Interneuron" & results_df$spot_class == "doublet_certain"], barcodes[results_df$first_type == "Interneuron" & results_df$spot_class == "doublet_uncertain"])
doub_second <- barcodes[results_df$second_type == "Interneuron" & results_df$spot_class == "doublet_certain"]
second_type_list <- unlist(list(results_df[doub_first,]$second_type,results_df[doub_second,]$first_type))
names(second_type_list) <- doublet_barcodes
inter_barcodes <- c(singlet_barcodes, doublet_barcodes)

puck <- readRDS('../../Data/SpatialRNA/Puck_200115_08/puckCropped.RDS')
cell_type_info <- readRDS(file.path(refdir,'info_renorm_coarse.RDS'))
gene_list <- intersect(rownames(cell_type_info[[1]]),rownames(puck@counts))
puck <- restrict_puck(puck, names(which(puck@nUMI >= 100)))
puck <- restrict_counts(puck, gene_list, UMI_max = 200000)

Q_mat <- readRDS(file.path('../../Data/SpatialRNA/Puck_200115_08/results','Q_mat.RDS'))
N_X = dim(Q_mat)[2]; delta = 1e-5; X_vals = (1:N_X)^1.5*delta
K_val = dim(Q_mat)[1] - 3; use_Q = T

Run RCTD on interneurons to classify into 3 interneuron subtypes

refdir <- '../../Data/Reference/DropVizHC'
inter_names<- c("Basket_OLM" ,'CGE',  "Neurogliaform_Lacunosum")
log_l_thresh <- 10
singlet_ind = results_df$first_type == "Interneuron" & results_df$spot_class == "singlet"
singlet_barcodes <- barcodes[singlet_ind]
doublet_barcodes <- c(barcodes[results_df$first_type == "Interneuron" & results_df$spot_class == "doublet_certain"], barcodes[results_df$first_type == "Interneuron" & results_df$spot_class == "doublet_uncertain"],
                      barcodes[results_df$second_type == "Interneuron" & results_df$spot_class == "doublet_certain"])
doub_first <- c(barcodes[results_df$first_type == "Interneuron" & results_df$spot_class == "doublet_certain"], barcodes[results_df$first_type == "Interneuron" & results_df$spot_class == "doublet_uncertain"])
doub_second <- barcodes[results_df$second_type == "Interneuron" & results_df$spot_class == "doublet_certain"]
second_type_list <- unlist(list(results_df[doub_first,]$second_type,results_df[doub_second,]$first_type))
names(second_type_list) <- doublet_barcodes
inter_barcodes <- c(singlet_barcodes, doublet_barcodes)
N = length(inter_barcodes)
inter_df <- data.frame(best_type = factor(character(N),levels = inter_names), confident = logical(N), score_diff = numeric(N))
rownames(inter_df) <- inter_barcodes
i <- 1
for(barcode in singlet_barcodes) {
  print(i)
  i <- i + 1
  score_best <- 100000
  score_second <- 100000
  best_type <- NULL
  for (type in inter_names) {
    score <- get_singlet_score(cell_type_info, gene_list, puck@counts[gene_list,barcode], puck@nUMI[barcode], type, F)
    if(score < score_best) {
      score_second <- score_best
      score_best <- score
      best_type <- type
    } else if(score < score_second) {
      score_second <- score
    }
    inter_df[barcode,type] <- score
  }
  inter_df[barcode,"confident"] <- (score_second - score_best) > log_l_thresh
  inter_df[barcode,"score_diff"] <- (score_second - score_best)
  inter_df[barcode,"best_type"] <- best_type
}

for(barcode in doublet_barcodes) {
  print(i)
  i <- i + 1
  score_best <- 100000
  score_second <- 100000
  best_type <- NULL
  for (type in inter_names) {
    score <- decompose_sparse(cell_type_info[[1]], gene_list, puck@nUMI[barcode], puck@counts[gene_list,barcode], type1=type, type2=as.character(second_type_list[barcode]), score_mode = T, constrain = F)
    if(score < score_best) {
      score_second <- score_best
      score_best <- score
      best_type <- type
    } else if(score < score_second) {
      score_second <- score
    }
    inter_df[barcode,type] <- score
  }
  inter_df[barcode,"confident"] <- (score_second - score_best) > log_l_thresh
  inter_df[barcode,"score_diff"] <- (score_second - score_best)
  inter_df[barcode,"best_type"] <- best_type
}

Plot the confident classification results in space

my_pal <- c("#D55E00","#9E0073", "#0072B2")
conf_inter <- inter_df$confident
barcodes_cur <- inter_barcodes[conf_inter]
new_class <- inter_df$best_type
names(new_class) <- inter_barcodes
counter_barcodes <- barcodes[results_df$spot_class == "singlet" & results_df$first_type %in% c("CA1","CA3","Denate")]
p3 <- plot_class(puck, barcodes_cur, new_class, counter_barcodes = counter_barcodes) + ggplot2::scale_color_manual("",values = my_pal)+ ggplot2::scale_shape_identity() + ggplot2::theme_classic() + ggplot2::scale_size_identity() + coord_fixed() + guides(colour = guide_legend(override.aes = list(size=2)))+ 
  scale_x_continuous(breaks = c(1500,3500,5500), limits = c(1450,5700)) + scale_y_continuous(breaks = c(2000,3250,4500), limits = c(1800,4700)) + theme(legend.position="top") +theme(legend.text=element_text(size=8),legend.spacing.x = unit(0, 'cm'))+geom_segment(aes(x = 1700, y = 2100, xend = 2084.6, yend = 2100), color = "black")+ theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(), axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())
ggarrange(p3)

image.png

Spatially cluster interneurons and compute rate of agreement of subclass classifications

d <- dist(puck@coords[inter_barcodes,], method = "euclidean")
hc1 <- hclust(d, method = "average")
num_clusters = 200

my_class <- as.factor(cutree(hc1,k=200))
#manually split doublet spatial clusters
library(plyr)
relabel <- read.csv(file.path('../../Data/SpatialRNA/Puck_200115_08','cluster_relabels.csv'))
for(i in unique(relabel$Cluster))
  relabel[relabel$Cluster==i,"barcodes"] <- mapvalues(relabel[relabel$Cluster==i,]$Index,from = which(rownames(puck@coords) %in% inter_barcodes[my_class==i]), to=inter_barcodes[my_class==i])
rownames(relabel) <- relabel$barcodes
new_labels <- as.character(my_class)
names(new_labels) <- names(my_class)
new_labels[relabel$barcode] <- apply(relabel,1,function(x) paste(x[3],x[2],sep='_')) 
new_labels <- as.factor(new_labels)
new_labels <- mapvalues(new_labels, from = levels(new_labels), to = sample(1:length(levels(new_labels))))

total = 0
agree = 0
total_sq = 0
for(i in levels(new_labels)) {
  these_barcodes = which(new_labels == i & conf_inter)
  if(length(these_barcodes) >= 2) {
    total_sq = total_sq + (length(these_barcodes)*(length(these_barcodes)-1)/2)^2
    total = total + length(these_barcodes)*(length(these_barcodes)-1)/2
    agree = agree + sum(unlist(lapply(table(new_class[these_barcodes]),function(x) x*(x-1)/2)))
  }
}
correct = agree / total
paste("Within cluster agreement: ", correct) # 0.971
std_cor = ((agree / total) - (agree / total)^2)*(total_sq/(total^2))
paste("Std dev: ", std_cor) # 0.009

Compare to Sst gene for reference

my_mod <- function(p) {
  p + scale_x_continuous(breaks = c(1500,3500,5500), limits = c(1450,5700)) + scale_y_continuous(breaks = c(2000,3250,4500), limits = c(1800,4700)) + geom_segment(aes(x = 1700, y = 2100, xend = 2084.6, yend = 2100), color = "black")+ theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(), axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())+ theme(legend.position="top")
}
MULT = 500
cur_range = c(0,MULT*0.003)
p4 <-plot_puck_continuous(puck,inter_barcodes,MULT*puck@counts["Sst",inter_barcodes]/puck@nUMI[inter_barcodes],counter_barcodes = counter_barcodes,ylimit=cur_range)
p4 <- my_mod(p4)+ggplot2::scale_colour_gradientn(paste("Sst Expression"), colors = pals::kovesi.rainbow(20), limits = cur_range, breaks = cur_range)
ggarrange(p4)

image.png

Plot all hippocampal cell types

my_mod <- function(p) {
  p + scale_x_continuous(breaks = c(1500,3500,5500), limits = c(1450,5700)) + scale_y_continuous(breaks = c(2000,3250,4500), limits = c(1800,4700)) + geom_segment(aes(x = 1700, y = 2100, xend = 2084.6, yend = 2100), color = "black")+  theme(axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(), axis.title.y=element_blank(),axis.text.y=element_blank(),axis.ticks.y=element_blank())+ theme(legend.position="top")
}
barcodes = rownames(results_df[results_df$spot_class != "reject" & puck@nUMI >= 200,])
my_table = puck@coords[barcodes,]
my_table$class = results_df[barcodes,]$first_type
n_levels = iv$cell_type_info[[3]]
my_pal = pals::kelly(n_levels+1)[2:(n_levels+1)]
names(my_pal) = iv$cell_type_info[[2]]
my_pal_curr <- my_pal
my_pal_curr["Ependymal"] <- "#D55E00"
my_pal_curr["Interneuron"] <- "#E69F00"
my_pal_curr["CA1"] <- "#56B4E9"
my_pal_curr["Denate"] <- "#009E73"
my_pal_curr["Oligodendrocyte"] <- "#EFCB00"
my_pal_curr["CA3"] <- "#0072B2"
my_pal_curr["Microglia_Macrophages"] <- "#000000"
my_pal_curr["Astrocyte"] <- "#CC79A7"
my_pal_curr["Choroid"] <- my_pal["Oligodendrocyte"]
my_pal_curr["Entorihinal"] <- my_pal["CA3"]
pres = unique(as.integer(my_table$class))
pres = pres[order(pres)]
p1 <- ggplot2::ggplot(my_table, ggplot2::aes(x=x, y=y)) + ggplot2::geom_point(ggplot2::aes(size = .05, shape=19,color=class)) + ggplot2::scale_color_manual("",values = my_pal_curr[pres],breaks = c('Astrocyte','Denate','Interneuron','Oligodendrocyte','Microglia_Macrophages','Ependymal','CA1','CA3'), labels = c('Astrocyte','Dentate','Interneuron','Oligo','Microglia','Ependymal','CA1','CA3'))+ ggplot2::scale_shape_identity() + ggplot2::theme_classic() + ggplot2::scale_size_identity() + coord_fixed() + guides(colour = guide_legend(override.aes = list(size=2)))+ theme(legend.position="top") +theme(legend.text=element_text(size=8),legend.spacing.x = unit(0, 'cm'))
p1 <- my_mod(p1)

ggarrange(p1)

image.png

Find marker genes

get_marker_data <- function(cell_type_names, cell_type_means, gene_list) {
  marker_means = cell_type_means[gene_list,]
  marker_norm = marker_means / rowSums(marker_means)
  marker_data = as.data.frame(cell_type_names[max.col(marker_means)])
  marker_data$max_epr <- apply(cell_type_means[gene_list,],1,max)
  colnames(marker_data) = c("cell_type",'max_epr')
  rownames(marker_data) = gene_list
  marker_data$log_fc <- 0
  epsilon <- 1e-9
  for(cell_type in unique(marker_data$cell_type)) {
    cur_genes <- gene_list[marker_data$cell_type == cell_type]
    other_mean = rowMeans(cell_type_means[cur_genes,cell_type_names != cell_type])
    marker_data$log_fc[marker_data$cell_type == cell_type] <- log(epsilon + cell_type_means[cur_genes,cell_type]) - log(epsilon + other_mean)
  }
  return(marker_data)
}

iv <- init_RCTD(load_info_renorm = T)
cur_cell_types <- c("Astrocyte","CA1","CA3","Denate","Interneuron","Neurogenesis","Oligodendrocyte")
puck <- readRDS('../../data/SpatialRNA/Puck_200115_08/puckCropped.RDS')
reference <- readRDS('../../data/Reference/DropVizHC/scRefSubsampled1000.RDS')
myRCTD <- create.RCTD(puck, reference)
cell_type_info_restr = myRCTD@cell_type_info$info
cell_type_info_restr[[1]] = cell_type_info_restr[[1]][,cur_cell_types]
cell_type_info_restr[[2]] = cur_cell_types; cell_type_info_restr[[3]] = length(cur_cell_types)

de_genes <- get_de_genes(cell_type_info_restr, puck, fc_thresh = 3, expr_thresh = .0001, MIN_OBS = 3)
#de_genes_spec <- get_de_genes(cell_type_info_restr, puck, fc_thresh = 4, expr_thresh = .0001, MIN_OBS = 3)
marker_data_de = get_marker_data(cell_type_info_restr[[2]], cell_type_info_restr[[1]], de_genes)
saveRDS(marker_data_de, '../../data/SpatialRNA/Puck_200115_08/results/marker_data_de_standard.RDS')

Plot interneurons (by RCTD) and interneuron markers

my_pal = pals::brewer.blues(20)[2:20]
cell_type = "Interneuron"
marker_data_de <- readRDS('../../Data/SpatialRNA/Puck_200115_08/results/marker_data_de_standard.RDS')
cur_range <- c(0,0.010*MULT)
gran_genes <- intersect(rownames(marker_data_de[marker_data_de$cell_type == cell_type,]), rownames(puck@counts))
p7 <- RCTD::plot_puck_continuous(puck,colnames(puck@counts)[puck@nUMI >= 300], MULT*colSums(puck@counts[gran_genes,])/puck@nUMI, ylimit = cur_range, size = .4, alpha = 1, small_point = T)
p7 <- my_mod(p7)+ggplot2::scale_colour_gradientn(paste(cell_type,"Markers"), colors = my_pal, limits = cur_range, breaks = cur_range)

cur_range <- c(0,1)
all_weights <- results$weights_doublet[puck@nUMI >= 300 & results_df$spot_class == "doublet_certain" & results_df$second_type == cell_type,2, drop=FALSE]
all_weights <- rbind(all_weights, results$weights_doublet[puck@nUMI >= 100 & !(results_df$spot_class == "reject") & results_df$first_type == cell_type,1,drop=FALSE])
all_weights_vec <- as.vector(all_weights); names(all_weights_vec) <- rownames(all_weights)
p8 <- RCTD::plot_puck_continuous(puck, rownames(all_weights), all_weights_vec, ylimit = cur_range, size = .6, alpha = 1, small_point = T)
p8 <- my_mod(p8)+ggplot2::scale_colour_gradientn(paste(cell_type,"Weight"), colors = my_pal, limits = cur_range, breaks = cur_range)
ggarrange(p7,p8,nrow=1,ncol=2)

image.png

生活很好，有你更好

10X单细胞空间联合分析之十（RCTD）

RCTD方法的文章Robust decomposition of cell type mixtures in spatial transcriptomics，2021年2月发表于nature biotechnology。IF = 54分，医学类杂志分数都普遍较高，方法的网址在RCTD,下面来几张分析的效果图：

当然，方法的灵活运用还是要看我们自己，好了，开始我们这次的分享。

Abstract

Introduction

Results

Challenges in spatial transcriptomics: cell type mixtures and platform effects.

这种差异可以通过平台效应的存在来解释，平台效应会导致基因表达在 snRNA-seq 和 scRNA-seq 之间成倍变化（看来大家在选择单细胞转录组还是单细胞核转录组的时候一定要慎重了）。

RCTD enables cross-platform detection of cell type mixtures.

为了考虑平台效应和其他自然变异源，例如空间变异性，我们假设 λ 是一个随机变量，定义为 ：

where μ represents the mean gene expression profile for cell type k, α is a fixed pixel-specific effect, γ is a gene-specific platform random effect and ε is a random effect to account for gene-specific overdispersion.

由于无法从原始数据中观察到基因特异性平台效应，作者开发了一个程序来估计具有 RCTD 的测序平台之间的平台效应。 在 snRNA-seq 小脑参考和 scRNA-seq 小脑数据集上训练 RCTD，验证了RCTD的方法能够可靠地恢复平台效应（R2 = 0.90）

最后，RCTD 以 12.8% r.m.s.e. 准确估计了样本中每种细胞类型的比例。 （±6.9% s.d. 跨越 66 个细胞类型对）（下图）。 这些技术验证表明，RCTD 可以准确地学习单个细胞混合数据集中的细胞类型信息。

RCTD localizes cell types in spatial transcriptomics data.

接下来，为了验证 RCTD 正确定位双峰的能力，我们利用了小脑的分层组织

RCTD found doublets within a layer and between adjacent layers but rarely between spatially separated layers

RCTD discovers spatial localization of cellular subtypes.

Additionally, we found that the spatial localization of the basket/OLM subclass coincided with the expression of Sst, a differentially expressed gene for this subclass

Finally, we used RCTD to assign each spatial cluster to 1 of 27 transcriptomically defined interneuron subtypes, confidently classifying the majority of interneuron pixels。

已知亚型的定位，例如出现在 CA1 的腔隙分子层 (SLM) 层中的 CA1 腔隙 (参考) 和主要出现在层层 (SO) 中的 OLM，与已知的解剖结构一致。 我们得出结论，RCTD 能够识别空间转录组学数据中细胞亚型的空间位置。

RCTD enables detection of spatially variable genes within cell type.（检测空间高变基因）

Discussion

Method（数学的知识大家多多学习，很多公式写起来很麻烦，就直接截图了）。

看看示例代码,示例代码很多，我们就分享其中的一部分。

RCTD on the Hippocampus and spatially localizing three interneuron subclasses

Load in results of RCTD and select interneurons

Run RCTD on interneurons to classify into 3 interneuron subtypes

Plot the confident classification results in space

Spatially cluster interneurons and compute rate of agreement of subclass classifications

Compare to Sst gene for reference

Plot all hippocampal cell types

Find marker genes

Plot interneurons (by RCTD) and interneuron markers

你可能感兴趣的:(10X单细胞空间联合分析之十（RCTD）)

为了考虑平台效应和其他自然变异源，例如空间变异性，我们假设 λ 是一个随机变量，定义为：

由于无法从原始数据中观察到基因特异性平台效应，作者开发了一个程序来估计具有 RCTD 的测序平台之间的平台效应。在 snRNA-seq 小脑参考和 scRNA-seq 小脑数据集上训练 RCTD，验证了RCTD的方法能够可靠地恢复平台效应（R2 = 0.90）

最后，RCTD 以 12.8% r.m.s.e. 准确估计了样本中每种细胞类型的比例。（±6.9% s.d. 跨越 66 个细胞类型对）（下图）。这些技术验证表明，RCTD 可以准确地学习单个细胞混合数据集中的细胞类型信息。

已知亚型的定位，例如出现在 CA1 的腔隙分子层 (SLM) 层中的 CA1 腔隙 (参考) 和主要出现在层层 (SO) 中的 OLM，与已知的解剖结构一致。我们得出结论，RCTD 能够识别空间转录组学数据中细胞亚型的空间位置。