单细胞数据整合-1：Harmony原理介绍和官网教程

目录：
1. 原理介绍
2. 操作演示
3. 关于Harmony操作是否会对差异分析产生影响

1. 原理介绍

官网：https://github.com/immunogenomics/harmony
（Harmony必须在R版本3.4以上运行，支持 Linux, OS X, and Windows 平台。）
文章：https://www.biorxiv.org/content/early/2018/11/04/461954
harmony算法与其他整合算法相比的优势：

（1）整合数据的同时对稀有细胞的敏感性依然很好；
（2）省内存；
（3）适合于更复杂的单细胞分析实验设计，可以比较来自不同供体，组织和技术平台的细胞。

基本原理：我们用不同颜色表示不同数据集，用形状表示不同的细胞类型。首先，Harmony应用主成分分析（一文看懂PCA主成分分析）将转录组表达谱嵌入到低维空间中，然后应用迭代过程去除数据集特有的影响。

（A）Harmony概率性地将细胞分配给cluster，从而使每个cluster内数据集的多样性最大化。
（B）Harmony计算每个cluster的所有数据集的全局中心，以及特定数据集的中心。
（C）在每个cluster中，Harmony基于中心为每个数据集计算校正因子。
（D）最后，Harmony使用基于C的特定于细胞的因子校正每个细胞。由于Harmony使用软聚类，因此可以通过多个因子的线性组合对其A中进行的软聚类分配进行线性校正，来修正每个单细胞。
重复步骤A到D，直到收敛为止。聚类分配和数据集之间的依赖性随着每一轮的减少而减小。

2. 操作演示

R包安装

library(devtools)
install_github("immunogenomics/harmony")

安装过程可能包括从源代码编译C++代码，因此可能需要几分钟。

下载稀疏矩阵示例(https://www.dropbox.com/s/t06tptwbyn7arb6/pbmc_stim.RData?dl=1)

library(Seurat)
library(cowplot)
library(harmony)
load('data/pbmc_stim.RData') #加载矩阵数据
#在运行Harmony之前，创建一个Seurat对象并按照标准PCA进行分析。
pbmc <- CreateSeuratObject(counts = cbind(stim.sparse, ctrl.sparse), project = "PBMC", min.cells = 5) %>%
    Seurat::NormalizeData(verbose = FALSE) %>%
    FindVariableFeatures(selection.method = "vst", nfeatures = 2000) %>%
    ScaleData(verbose = FALSE) %>%
    RunPCA(pc.genes = [email protected], npcs = 20, verbose = FALSE) #R语言中%>%的含义是什么呢，管道函数啦，就是把左件的值发送给右件的表达式，并作为右件表达式函数的第一个参数。
[email protected]$stim <- c(rep("STIM", ncol(stim.sparse)), rep("CTRL", ncol(ctrl.sparse)))#赋值条件变量

未经校正的PC中的数据集之间存在明显差异：

options(repr.plot.height = 5, repr.plot.width = 12)
p1 <- DimPlot(object = pbmc, reduction = "pca", pt.size = .1, group.by = "stim")
p2 <- VlnPlot(object = pbmc, features = "PC_1", group.by = "stim", pt.size = .1)
plot_grid(p1,p2)

Run Harmony

运行Harmony的最简单方法是传递Seurat对象并指定要集成的变量。RunHarmony返回Seurat对象，并使用更正后的Harmony坐标（使用Harmony代替PCA）。将plot_convergence设置为TRUE，这样我们就可以确保Harmony目标函数在每一轮中都变得更好。

RunHarmony函数中主要参数：

group.by.vars参数是设置按哪个分组来整合

max.iter.harmony设置迭代次数，默认是10。运行RunHarmony结果会提示在迭代多少次后完成了收敛。

⚠️lambda参数，默认值是1，决定了Harmony整合的力度。lambda值调小，整合力度变大，反之。（只有这个参数影响整合力度，调整范围一般在0.5-2之间）

⚠️theta参数：Diversity clustering penalty parameter. Specify for each variable in group.by.vars. Default theta=2. theta=0 does not encourage any diversity. Larger values of theta result in more diverse clusters. 这个参数我常用默认值，但在不同文献中这个参数往往不同。

⚠️dims.use参数：Which PCA dimensions to use for Harmony. By default, use all.

sigma参数：Width of soft kmeans clusters. Default sigma=0.1. Sigma scales the distance from a cell to cluster centroids. Larger values of sigma result in cells assigned to more clusters. Smaller values of sigma make soft kmeans cluster approach hard clustering.

options(repr.plot.height = 2.5, repr.plot.width = 6)
pbmc <- pbmc %>%
RunHarmony("stim", plot_convergence = TRUE) #Harmony converged after 8 iterations

Harmory运行后的结果储存在：

pbmc@reductions$harmony

使用Embeddings命令访问新的Harmony embeddings。

harmony_embeddings <- Embeddings(pbmc, 'harmony')
harmony_embeddings[1:5, 1:5]

让我们查看确认数据集在Harmony运行之后的前两个维度中得到很好的整合。

options(repr.plot.height = 5, repr.plot.width = 12)
p1 <- DimPlot(object = pbmc, reduction = "harmony", pt.size = .1, group.by = "stim")
p2 <- VlnPlot(object = pbmc, features = "harmony_1", group.by = "stim", pt.size = .1)
plot_grid(p1,p2)

Downstream analysis

许多下游分析是在低维嵌入而不是基因表达上进行的。要使用校正后的Harmony embeddings而不是PC，设置reduction ='harmony'。

pbmc <- pbmc %>%
    RunUMAP(reduction = "harmony", dims = 1:20) %>%
    FindNeighbors(reduction = "harmony", dims = 1:20) %>%
    FindClusters(resolution = 0.5) %>%
    identity()

在UMAP embedding中，我们可以看到更复杂的结构。由于我们使用harmony embeddings，因此UMAP embeddings混合得很好。

options(repr.plot.height = 4, repr.plot.width = 10)
DimPlot(pbmc, reduction = "umap", group.by = "stim", pt.size = .1, split.by = 'stim')

TSNE分析

pbmc=RunTSNE(pbmc,reduction = "harmony", dims = 1:20)
TSNEPlot(object = pbmc, pt.size = 0.5, label = TRUE,split.by='stim')

两样本合并的TSNE和UMAP图

DimPlot(pbmc, reduction = "umap",pt.size = .1,  label = TRUE)
TSNEPlot(pbmc, pt.size = .1, label = TRUE)

随后就可以寻找差异表达基因并对细胞进行注释。

3. 关于Harmony操作是否会对差异分析产生影响

Harmony输入的是scRNA@reductions$pca的数据，得出的结果储存在scRNA@reductions$harmony中。

而差异分析使用的是scRNA@assays$RNA@counts数据，互不影响。

4. 多样本批次矫正方法汇总

工具		Batch-effect-corrected output	方法
Seurat2	R	Normalized canonical components	Canonical correlation analysis and dynamic time warping
`Seurat3`	R	Normalized gene expression matrix	CCA and mutural nearest neighbors-anchors
`Harmony`	R	Normalized feature reduction vectors	Iterative clustering in dimensionally reduced space
`MNN Correct`	R	Normalized gene expression matrix	Mutual nearest neighbor in gene expression space
fastMNN	R	Normalized principal components	MNN in dimensionally reduced space
ComBat	R	Normalized gene expression matrix	Adjusts for known batches using an empirical Bayesian framework
limma	R	Normalized gene expression matrix	Linear model/empirical Bayes model
scGen	R	Normalized gene expression matrix	Variational auto-encoders neural network model and latent space
Scanorama	R/P	Normalized gene expression matrix	Mutual nearest neighbor and panoramic stitching
MND-ResNet	P	Normalized principal components	Residual neural network for calibration
ZINB-WaVE	R	Normalized gene expression matrix	Zero-inflated negative binomial model, extension of RUV model
scMerge	R	Normalized gene expression matrix	Stably expressed genes (scSEGs) and RUVIII model
`LIGER`	R	Normalized feature reduction vectors	Integrative non-negative matrix factorization (iNMF) and joint clustering + quantile alignment
`BBKNN`	P	Connectivity graph and normalized dimension reduction vectors (UMAP)	Batch balanced k-nearest neighbors

单细胞数据整合-1：Harmony原理介绍和官网教程

1. 原理介绍

2. 操作演示

3. 关于Harmony操作是否会对差异分析产生影响

4. 多样本批次矫正方法汇总

你可能感兴趣的:(单细胞数据整合-1：Harmony原理介绍和官网教程)