10X单细胞(10X空间转录组)分析回顾之harmony的各种运用(联合NMF和python的harmonypy)

隔离的第八天,孤独的情绪渐渐消散了,音乐,朋友,亲情,还有游戏,果然让心情会变好很多,有位圣贤说过,其实人过的快不快乐,并不是说我们拥有了房子,车子,还有和喜欢的姑娘在一起,而是良好的人际关系,好了,这一篇我来详细分享一下harmony的各种用法,包括harmony与Seurat 的联合使用,harmony与NMF的联合使用,以及python版本的harmonypy的运用,我们一一来分享,希望大家有所收益。

oytdd2r83q.png

1、R版本,harmony and Seurat,这个应该是运用的最多的,示例代码如下:

library(Seurat)
library(harmony)
............前面的处理过程大家自己看一下就好

numsap=1
for (each in samples){
  pbmc <- readRDS(paste0(path,'/',each,'_QC.rds'))
    colnames(pbmc@assays$RNA@counts) <- str_replace_all(colnames(pbmc@assays$RNA@counts), '-1',paste0('-',numsap))
    ob <- CreateSeuratObject(counts =pbmc@assays$RNA@counts,project =each,min.cells = min_cells)
    ob$stim <-each
    ob <- NormalizeData(ob)
#  ob <- FindVariableFeatures(ob,  selection.method = "vst",nfeatures = Nfeatures)
    numsap=numsap+1
    ob.list[[each]] <- ob
}
outdir = myoutdir

seurat.obj = merge(x=ob.list[[1]], y=ob.list[[2]])

if(length(ob.list) > 2){
  for (j in 3:length(ob.list)){
    seurat.obj = merge(x=seurat.obj, y=ob.list[[j]])
  }
}


#Nfeatures=length(rownames(x =seurat.obj)) 尽量不要用所有基因跑
seurat.obj <- FindVariableFeatures(seurat.obj,  selection.method = "vst",nfeatures = Nfeatures)

seurat.obj<-ScaleData(seurat.obj,verbose = FALSE,split.by = 'orig.idents')

seurat.obj<-RunPCA(seurat.obj,pc.genes = [email protected], npcs = cca_use, verbose = FALSE)

sample = compare

# run harmany batch correction
seurat.obj <- RunHarmony(seurat.obj,"stim", plot_convergence = TRUE)
harmony_embeddings <- Embeddings(seurat.obj, 'harmony')

harmony.data<-cbind(rownames(harmony_embeddings),harmony_embeddings)
colnames(harmony.data)[1]<-"Barcode"
write.table(harmony.data,file=paste0(outdir,'/Anchors/',prefix,'_harmony.csv'),row.names=F,col.names=T,sep=',',quote=F)

## run TSNE
seurat.obj = RunTSNE(seurat.obj, dims.use=1:cca_use, do.fast=TRUE, reduction.use="harmony")
# run UMAP
seurat.obj <- RunUMAP(seurat.obj, reduction = "harmony", dims = 1:cca_use,umap.method = "umap-learn",metric = "correlation")
seurat.obj <- FindNeighbors(seurat.obj, reduction = "harmony", dims = 1:cca_use)
seurat.obj <- FindClusters(seurat.obj, reduction.type = "harmony",resolution = resolution)

#saveRDS(seurat.obj,file=paste0(outdir,'/',prefix,'_combined.rds'))

需要注意的地方

  • 读取的时候我这里写的是QC.rds,大家也可以读矩阵,10X的结果文件等等,不要拘泥于我写的方法。
  • RunHarmony的第二个参数我写了stim,其实就是样本信息,这个分析因人而异,你的样本信息保存在哪一列,你就写哪一列。
  • 示例代码有一些参数用变量替代,如果大家做过单细胞,应该都很容易理解。

好了,我们的第二部分,NMF and harmony

前段时间有位朋友说NMF和harmony能不能联合使用,答案是可以的,这里展示一下示例代码,但是大家要知道哪些单细胞处理的分析软件用到了NMF,当然,我们分析一般使用基础的NMF包就可以了,我们先来看看NMF如何与Seurat无缝衔接(这部分大家需要掌握):

library(Seurat)
library(tidyverse)
library(NMF)
rm(list = ls())

## 创建seurat对象
pbmc <- Read10X_h5("pbmc.h5")
pbmc <- CreateSeuratObject(pbmc, project = "pbmc", min.cells = 3, min.features = 500)
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, percent.mt<20)
pbmc <- NormalizeData(pbmc) %>% FindVariableFeatures() %>% ScaleData(do.center = F)
## 高变基因表达矩阵的分解
# pbmc大体可分成T,B,NK,CD14+Mono,CD16+Mono,DC,Platelet等类型,考虑冗余后设置rank=10
vm <- pbmc@[email protected]
res <- nmf(vm, 10, method = "snmf/r", seed = 'nndsvd') 
runtime(res)
#    用户     系统     流逝 
#1063.147   78.019 1139.831 

## 分解结果返回suerat对象
pbmc@reductions$nmf <- pbmc@reductions$pca
pbmc@[email protected] <- t(coef(res))    
pbmc@[email protected] <- basis(res)  

## 使用nmf的分解结果降维聚类
set.seed(219)
pbmc.nmf <- RunUMAP(pbmc, reduction = 'nmf', dims = 1:10) %>% 
FindNeighbors(reduction = 'nmf', dims = 1:10) %>% FindClusters()
其中有个问题,为什么要用NMF??
  • 对比PCA分析的结果,NMF虽然毫不逊色,但是它的运行时间更长,我们为什么要用NMF呢?一个很重要的原因是NMF的因子可解释性更强,每个因子贡献度最大的基因基本代表了某种或某个状态细胞的表达模式,相比差异分析得到marker基因更有代表性。

接下来就是NMF 与 harmony的联合使用,相信大家都已经会写了,示例代码放在这里:

library(Seurat)
library(tidyverse)
library(NMF)
rm(list = ls())

## 创建seurat对象
pbmc <- Read10X_h5("pbmc.h5")
pbmc <- CreateSeuratObject(pbmc, project = "pbmc", min.cells = 3, min.features = 500)
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, percent.mt<20)
pbmc <- NormalizeData(pbmc) %>% FindVariableFeatures() %>% ScaleData(do.center = F)
## 高变基因表达矩阵的分解
# pbmc大体可分成T,B,NK,CD14+Mono,CD16+Mono,DC,Platelet等类型,考虑冗余后设置rank=10
vm <- pbmc@[email protected]
res <- nmf(vm, 10, method = "snmf/r", seed = 'nndsvd') 
runtime(res)
#    用户     系统     流逝 
#1063.147   78.019 1139.831 

## 分解结果返回suerat对象
pbmc@reductions$nmf <- pbmc@reductions$pca
pbmc@[email protected] <- t(coef(res))    
pbmc@[email protected] <- basis(res)  

seurat.obj <- RunHarmony(seurat.obj,"stim", plot_convergence = TRUE)
harmony_embeddings <- Embeddings(seurat.obj, 'harmony')

harmony.data<-cbind(rownames(harmony_embeddings),harmony_embeddings)

## 使用nmf的分解结果降维聚类
set.seed(219)
pbmc.nmf <- RunUMAP(pbmc, reduction = 'harmony', dims = 1:10) %>% 
FindNeighbors(reduction = 'nmf', dims = 1:10) %>% FindClusters()
####下游的处理都一样了

但是这里注意一个问题,harmony默认是识别pca的,如果出来slot报错,就用NMF的分析结果覆盖掉PCA。

python版本的harmonypy

安装的话很简单

pip install harmonypy

示例代码
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans
from scipy.stats.stats import pearsonr
import harmonypy as hm

meta_data = pd.read_csv("data/meta.tsv.gz", sep = "\t")
data_mat = pd.read_csv("data/pcs.tsv.gz", sep = "\t")
data_mat = np.array(data_mat)
vars_use = ['dataset']

# meta_data
#
#                  cell_id dataset  nGene  percent_mito cell_type
# 0    half_TGAAATTGGTCTAG    half   3664      0.017722    jurkat
# 1    half_GCGATATGCTGATG    half   3858      0.029228      t293
# 2    half_ATTTCTCTCACTAG    half   4049      0.015966    jurkat
# 3    half_CGTAACGACGAGAG    half   3443      0.020379    jurkat
# 4    half_ACGCCTTGTTTACC    half   2813      0.024774      t293
# ..                   ...     ...    ...           ...       ...
# 295  t293_TTACGTACGACACT    t293   4152      0.033997      t293
# 296  t293_TAGAATTGTTGGTG    t293   3097      0.021769      t293
# 297  t293_CGGATAACACCACA    t293   3157      0.020411      t293
# 298  t293_GGTACTGAGTCGAT    t293   2685      0.027846      t293
# 299  t293_ACGCTGCTTCTTAC    t293   3513      0.021240      t293

# [300 rows x 5 columns]

# data_mat[:5,:5]
#
# array([[ 0.0071695 , -0.00552724, -0.0036281 , -0.00798025,  0.00028931],
#        [-0.011333  ,  0.00022233, -0.00073589, -0.00192452,  0.0032624 ],
#        [ 0.0091214 , -0.00940727, -0.00106816, -0.0042749 , -0.00029096],
#        [ 0.00866286, -0.00514987, -0.0008989 , -0.00821785, -0.00126997],
#        [-0.00953977,  0.00222714, -0.00374373, -0.00028554,  0.00063737]])

ho = hm.run_harmony(data_mat, meta_data, vars_use)

# Write the adjusted PCs to a new file.
res = pd.DataFrame(ho.Z_corr)
res.columns = ['X{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/adj.tsv.gz", sep = "\t", index = False)

# Test 2
########################################################################

import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans
from scipy.stats.stats import pearsonr
import harmonypy as hm

meta_data = pd.read_csv("data/pbmc_3500_meta.tsv.gz", sep = "\t")
data_mat = pd.read_csv("data/pbmc_3500_pcs.tsv.gz", sep = "\t")

from time import time

start = time()
ho = hm.run_harmony(data_mat, meta_data, ['donor'])
end = time()
print("elapsed {:.2f} seconds".format(end - start)) # 24 seconds for python, 5 seconds for Rcpp

res = pd.DataFrame(ho.Z_corr).T
res.columns = ['PC{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/pbmc_3500_pcs_harmonized_python.tsv.gz", sep = "\t", index = False)

harm = pd.read_csv("data/pbmc_3500_pcs_harmonized.tsv.gz", sep = "\t")

cors = []
for i in range(res.shape[1]):
    cors.append(pearsonr(res.iloc[:,i].values, harm.iloc[:,i].values))
print([np.round(x[0], 3) for x in cors])

python版本的harmony是为了scanpy而准备的,所以也是和scanpy无缝衔接,写法跟上面的R版本很一致,把scanpy对应的信息给到harmony就可以了

import harmonypy as hm
import pandas as pd
import scanpy as sc

adata = adata = sc.read_10x_mtx(
    'data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
    cache=True)    

###或者读取其他格式的文件,包括h5ad,loom,csv,h5等,看大家的需求,前面的质控处理就不多说了,直接做完pca,然后下一步:

ho = hm.run_harmony(adata.X, adata.obs, ['Sample'])

res = pd.DataFrame(ho.Z_corr)
res.columns = ['X{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/adj.tsv.gz", sep = "\t", index = False)

###然后一样,把得到的信息添加到scanpy的分析对象就可以了

最后是python版本的NMF与harmony的联合使用,就不多介绍了,大家应该都会写了。

生活很好,有你更好

你可能感兴趣的:(10X单细胞(10X空间转录组)分析回顾之harmony的各种运用(联合NMF和python的harmonypy))