10X单细胞(10X空间转录组)批次去除(整合)分析之Scanorama

hello,大家好,今天我们来分享一下scanpy做整合分析的一个方法---Scanorama,关于这个方法,相信用过scanpy做数据分析的同学应该都不陌生,今天我们来分享一下,因为这个方法,在特定的情况下,效果最好。文献在Efficient integration of heterogeneous single-cell transcriptomes using Scanorama,2019年发表于Nature Biotechnology,影响因子36分。

我们简单回顾一下文献,代码么,大家应该都很熟悉了。

Abstract

1、but current methods for scRNA-seq data integration are limited by a requirement for datasets to derive from functionally similar cells(这个里面最典型的方法就是Seurat做整合分析的CCA).

2、We present Scanorama, an algorithm that identifies and merges the shared cell types among all pairs of datasets and accurately integrates heterogeneous collections of scRNA-seq data.(Scanorama可以保留数据之间的异质性,这个最重要)。

Introduction

1、these approaches automatically assume that all datasets share at least one cell type in common9 or that the gene expression profiles share largely the same correlation structure across all datasets。(找锚点的基本)这些方法,往往存在过矫正的问题,尤其数据本就存在生物学差异。

2、Scanorama:analogous to computer vision algorithms for panorama stitching that identify images with overlapping content and merge these into a larger panorama(Scanorama:类似于用于全景拼接的计算机视觉算法,可识别具有重叠内容的图像并将其合并为更大的全景图 ,也就是说做整合的时候,数据之间要有一些细胞类型是共有的)。

图片.png

3、Scanorama automatically identifies scRNA-seq datasets containing cells with similar transcriptional profiles and can leverage those matches for batch correction and integration,without also merging datasets that do not overlap。原理上还很不错的。

图片.png

图注:A similar strategy can also be used to merge heterogeneous scRNA-seq datasets. Scanorama searches nearest neighbors to identify shared cell types among all pairs of datasets. Dimensionality reduction techniques and an approximate nearest-neighbors algorithm based on hyperplane locality sensitive hashing and random projection trees greatly accelerates the search step. Mutually linked cells form matches that can be leveraged to correct for batch effects and merge experiments together (Methods), whereby the datasets forming connected components on the basis of these matches become a scRNA-seq ‘panorama’.

4、方法的优势Scanorama is robust to different dataset sizes and sources, preserves dataset-specific populations and does not require that all datasets share at least one cell population(重点就是does not require that all datasets share at least one cell population,保留数据本身的异质性)。

5、Our approach generalizes mutual nearest-neighbors matching, a technique that finds similar elements between two datasets, to instead find similar elements among many datasets.(也就是两两数据找“邻居”,而不是找共有“邻居”)。

6、对于多个数据的整合,existing methods select one dataset as a reference and successively integrate all other datasets into the reference(以一个数据作为参考集), one at a time, which may lead to suboptimal results depending on the order in which the datasets are considered,这个确实是很大的问题,不过新版的Seurat已经将这个问题优化了。

图片.png

7、Although Scanorama takes a similar approach when aligning a collection of two datasets, on larger collections of data it is insensitive to order and less vulnerable to overcorrection because it finds matches between all pairs of datasets.(对参考集不敏感,不会过矫正)。

8、数据之间寻找匹配的细胞,有两个关键步骤 。一、我们不是在高维基因空间中执行最近邻匹配,而是使用逐个基因表达矩阵的高效随机奇异值分解 (SVD,线性代数的知识) 将每个细胞的基因表达谱压缩到低维嵌入中,这还有助于提高方法对噪声的鲁棒性,就是低维空间找“近邻”。 二、we use an approximate nearest neighbor search based on hyperplane locality sensitive hashing(超平面,这个大家可以参考文章10X单细胞(10X空间转录组)降维分析之tSNE(算法基础知识)) and random projection trees to greatly reduce the nearest neighbor query time both asymptotically and in practice。

简单看一下文献的示例结果

结果1、Improved integration of simulated and toy scRNA-seq datasets.

To verify the merit of our approach, we first tested Scanorama on simulated data and a small collection of scRNA-seq datasets.
图片.png
包括真实的数据:两种数据
图片.png
In both cases, we were able to merge common cell types across datasets without also merging disparate(不同的) cell types together.
图片.png

图片.png
明显CCA和MNN是过矫正的,In contrast, existing integration methods are either sensitive to the order in which datasets are considered or are highly prone to overcorrection。

图片.png

scran MNN corrected (c) and Seurat CCA integrated (d)

图片.png
存在过矫正,不知道和harmony比较会怎么样。

结果2、Scanorama integrates 105,476 cells from 26 diverse datasets.

图片.png
图片.png
图片.png

结果2、简单看看就好,文献给出的数据示例绝对是他的软件最好~~~

简单看看示例代码

其实就一个scanpy的函数

scanorama.integrate_scanpy(adatas, dimred = 50)
# Get all the integrated matrices.
scanorama_int = [ad.obsm['X_scanorama'] for ad in adatas]

# make into one matrix.
all_s = np.concatenate(scanorama_int)
print(all_s.shape)
# add to the AnnData object
adata.obsm["Scanorama"] = all_s

不过scanpy早已经更新了,函数名字已经变了,大家多看看官网,逐步提高

生活很好,等你超越

你可能感兴趣的:(10X单细胞(10X空间转录组)批次去除(整合)分析之Scanorama)