隔离的第八天,孤独的情绪渐渐消散了,音乐,朋友,亲情,还有游戏,果然让心情会变好很多,有位圣贤说过,其实人过的快不快乐,并不是说我们拥有了房子,车子,还有和喜欢的姑娘在一起,而是良好的人际关系,好了,这一篇我来详细分享一下harmony的各种用法,包括harmony与Seurat 的联合使用,harmony与NMF的联合使用,以及python版本的harmonypy的运用,我们一一来分享,希望大家有所收益。
1、R版本,harmony and Seurat,这个应该是运用的最多的,示例代码如下:
library(Seurat)
library(harmony)
............前面的处理过程大家自己看一下就好
numsap=1
for (each in samples){
pbmc <- readRDS(paste0(path,'/',each,'_QC.rds'))
colnames(pbmc@assays$RNA@counts) <- str_replace_all(colnames(pbmc@assays$RNA@counts), '-1',paste0('-',numsap))
ob <- CreateSeuratObject(counts =pbmc@assays$RNA@counts,project =each,min.cells = min_cells)
ob$stim <-each
ob <- NormalizeData(ob)
# ob <- FindVariableFeatures(ob, selection.method = "vst",nfeatures = Nfeatures)
numsap=numsap+1
ob.list[[each]] <- ob
}
outdir = myoutdir
seurat.obj = merge(x=ob.list[[1]], y=ob.list[[2]])
if(length(ob.list) > 2){
for (j in 3:length(ob.list)){
seurat.obj = merge(x=seurat.obj, y=ob.list[[j]])
}
}
#Nfeatures=length(rownames(x =seurat.obj)) 尽量不要用所有基因跑
seurat.obj <- FindVariableFeatures(seurat.obj, selection.method = "vst",nfeatures = Nfeatures)
seurat.obj<-ScaleData(seurat.obj,verbose = FALSE,split.by = 'orig.idents')
seurat.obj<-RunPCA(seurat.obj,pc.genes = [email protected], npcs = cca_use, verbose = FALSE)
sample = compare
# run harmany batch correction
seurat.obj <- RunHarmony(seurat.obj,"stim", plot_convergence = TRUE)
harmony_embeddings <- Embeddings(seurat.obj, 'harmony')
harmony.data<-cbind(rownames(harmony_embeddings),harmony_embeddings)
colnames(harmony.data)[1]<-"Barcode"
write.table(harmony.data,file=paste0(outdir,'/Anchors/',prefix,'_harmony.csv'),row.names=F,col.names=T,sep=',',quote=F)
## run TSNE
seurat.obj = RunTSNE(seurat.obj, dims.use=1:cca_use, do.fast=TRUE, reduction.use="harmony")
# run UMAP
seurat.obj <- RunUMAP(seurat.obj, reduction = "harmony", dims = 1:cca_use,umap.method = "umap-learn",metric = "correlation")
seurat.obj <- FindNeighbors(seurat.obj, reduction = "harmony", dims = 1:cca_use)
seurat.obj <- FindClusters(seurat.obj, reduction.type = "harmony",resolution = resolution)
#saveRDS(seurat.obj,file=paste0(outdir,'/',prefix,'_combined.rds'))
需要注意的地方
- 读取的时候我这里写的是QC.rds,大家也可以读矩阵,10X的结果文件等等,不要拘泥于我写的方法。
- RunHarmony的第二个参数我写了stim,其实就是样本信息,这个分析因人而异,你的样本信息保存在哪一列,你就写哪一列。
- 示例代码有一些参数用变量替代,如果大家做过单细胞,应该都很容易理解。
好了,我们的第二部分,NMF and harmony
前段时间有位朋友说NMF和harmony能不能联合使用,答案是可以的,这里展示一下示例代码,但是大家要知道哪些单细胞处理的分析软件用到了NMF,当然,我们分析一般使用基础的NMF包就可以了,我们先来看看NMF如何与Seurat无缝衔接(这部分大家需要掌握):
library(Seurat)
library(tidyverse)
library(NMF)
rm(list = ls())
## 创建seurat对象
pbmc <- Read10X_h5("pbmc.h5")
pbmc <- CreateSeuratObject(pbmc, project = "pbmc", min.cells = 3, min.features = 500)
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, percent.mt<20)
pbmc <- NormalizeData(pbmc) %>% FindVariableFeatures() %>% ScaleData(do.center = F)
## 高变基因表达矩阵的分解
# pbmc大体可分成T,B,NK,CD14+Mono,CD16+Mono,DC,Platelet等类型,考虑冗余后设置rank=10
vm <- pbmc@[email protected]
res <- nmf(vm, 10, method = "snmf/r", seed = 'nndsvd')
runtime(res)
# 用户 系统 流逝
#1063.147 78.019 1139.831
## 分解结果返回suerat对象
pbmc@reductions$nmf <- pbmc@reductions$pca
pbmc@[email protected] <- t(coef(res))
pbmc@[email protected] <- basis(res)
## 使用nmf的分解结果降维聚类
set.seed(219)
pbmc.nmf <- RunUMAP(pbmc, reduction = 'nmf', dims = 1:10) %>%
FindNeighbors(reduction = 'nmf', dims = 1:10) %>% FindClusters()
其中有个问题,为什么要用NMF??
- 对比PCA分析的结果,NMF虽然毫不逊色,但是它的运行时间更长,我们为什么要用NMF呢?一个很重要的原因是NMF的因子可解释性更强,每个因子贡献度最大的基因基本代表了某种或某个状态细胞的表达模式,相比差异分析得到marker基因更有代表性。
接下来就是NMF 与 harmony的联合使用,相信大家都已经会写了,示例代码放在这里:
library(Seurat)
library(tidyverse)
library(NMF)
rm(list = ls())
## 创建seurat对象
pbmc <- Read10X_h5("pbmc.h5")
pbmc <- CreateSeuratObject(pbmc, project = "pbmc", min.cells = 3, min.features = 500)
pbmc$percent.mt <- PercentageFeatureSet(pbmc, pattern = "^MT-")
pbmc <- subset(pbmc, percent.mt<20)
pbmc <- NormalizeData(pbmc) %>% FindVariableFeatures() %>% ScaleData(do.center = F)
## 高变基因表达矩阵的分解
# pbmc大体可分成T,B,NK,CD14+Mono,CD16+Mono,DC,Platelet等类型,考虑冗余后设置rank=10
vm <- pbmc@[email protected]
res <- nmf(vm, 10, method = "snmf/r", seed = 'nndsvd')
runtime(res)
# 用户 系统 流逝
#1063.147 78.019 1139.831
## 分解结果返回suerat对象
pbmc@reductions$nmf <- pbmc@reductions$pca
pbmc@[email protected] <- t(coef(res))
pbmc@[email protected] <- basis(res)
seurat.obj <- RunHarmony(seurat.obj,"stim", plot_convergence = TRUE)
harmony_embeddings <- Embeddings(seurat.obj, 'harmony')
harmony.data<-cbind(rownames(harmony_embeddings),harmony_embeddings)
## 使用nmf的分解结果降维聚类
set.seed(219)
pbmc.nmf <- RunUMAP(pbmc, reduction = 'harmony', dims = 1:10) %>%
FindNeighbors(reduction = 'nmf', dims = 1:10) %>% FindClusters()
####下游的处理都一样了
但是这里注意一个问题,harmony默认是识别pca的,如果出来slot报错,就用NMF的分析结果覆盖掉PCA。
python版本的harmonypy
安装的话很简单
pip install harmonypy
示例代码
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans
from scipy.stats.stats import pearsonr
import harmonypy as hm
meta_data = pd.read_csv("data/meta.tsv.gz", sep = "\t")
data_mat = pd.read_csv("data/pcs.tsv.gz", sep = "\t")
data_mat = np.array(data_mat)
vars_use = ['dataset']
# meta_data
#
# cell_id dataset nGene percent_mito cell_type
# 0 half_TGAAATTGGTCTAG half 3664 0.017722 jurkat
# 1 half_GCGATATGCTGATG half 3858 0.029228 t293
# 2 half_ATTTCTCTCACTAG half 4049 0.015966 jurkat
# 3 half_CGTAACGACGAGAG half 3443 0.020379 jurkat
# 4 half_ACGCCTTGTTTACC half 2813 0.024774 t293
# .. ... ... ... ... ...
# 295 t293_TTACGTACGACACT t293 4152 0.033997 t293
# 296 t293_TAGAATTGTTGGTG t293 3097 0.021769 t293
# 297 t293_CGGATAACACCACA t293 3157 0.020411 t293
# 298 t293_GGTACTGAGTCGAT t293 2685 0.027846 t293
# 299 t293_ACGCTGCTTCTTAC t293 3513 0.021240 t293
# [300 rows x 5 columns]
# data_mat[:5,:5]
#
# array([[ 0.0071695 , -0.00552724, -0.0036281 , -0.00798025, 0.00028931],
# [-0.011333 , 0.00022233, -0.00073589, -0.00192452, 0.0032624 ],
# [ 0.0091214 , -0.00940727, -0.00106816, -0.0042749 , -0.00029096],
# [ 0.00866286, -0.00514987, -0.0008989 , -0.00821785, -0.00126997],
# [-0.00953977, 0.00222714, -0.00374373, -0.00028554, 0.00063737]])
ho = hm.run_harmony(data_mat, meta_data, vars_use)
# Write the adjusted PCs to a new file.
res = pd.DataFrame(ho.Z_corr)
res.columns = ['X{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/adj.tsv.gz", sep = "\t", index = False)
# Test 2
########################################################################
import pandas as pd
import numpy as np
from scipy.cluster.vq import kmeans
from scipy.stats.stats import pearsonr
import harmonypy as hm
meta_data = pd.read_csv("data/pbmc_3500_meta.tsv.gz", sep = "\t")
data_mat = pd.read_csv("data/pbmc_3500_pcs.tsv.gz", sep = "\t")
from time import time
start = time()
ho = hm.run_harmony(data_mat, meta_data, ['donor'])
end = time()
print("elapsed {:.2f} seconds".format(end - start)) # 24 seconds for python, 5 seconds for Rcpp
res = pd.DataFrame(ho.Z_corr).T
res.columns = ['PC{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/pbmc_3500_pcs_harmonized_python.tsv.gz", sep = "\t", index = False)
harm = pd.read_csv("data/pbmc_3500_pcs_harmonized.tsv.gz", sep = "\t")
cors = []
for i in range(res.shape[1]):
cors.append(pearsonr(res.iloc[:,i].values, harm.iloc[:,i].values))
print([np.round(x[0], 3) for x in cors])
python版本的harmony是为了scanpy而准备的,所以也是和scanpy无缝衔接,写法跟上面的R版本很一致,把scanpy对应的信息给到harmony就可以了
import harmonypy as hm
import pandas as pd
import scanpy as sc
adata = adata = sc.read_10x_mtx(
'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True)
###或者读取其他格式的文件,包括h5ad,loom,csv,h5等,看大家的需求,前面的质控处理就不多说了,直接做完pca,然后下一步:
ho = hm.run_harmony(adata.X, adata.obs, ['Sample'])
res = pd.DataFrame(ho.Z_corr)
res.columns = ['X{}'.format(i + 1) for i in range(res.shape[1])]
res.to_csv("data/adj.tsv.gz", sep = "\t", index = False)
###然后一样,把得到的信息添加到scanpy的分析对象就可以了
最后是python版本的NMF与harmony的联合使用,就不多介绍了,大家应该都会写了。
生活很好,有你更好