夏至:炎热的夏天来临。
随着单细胞测序技术的成熟,越来越多的研究者选择应用该技术来阐释手上的生物学问题。同时单细胞也不再是单样本单物种单器官的技术,往往会用到多样本整合分析的技术,这方面Seurat团队是最值得关注的。他们提出了一套用于单细胞样本整合分析的算法,Comprehensive integration of single cell data,并打包成Rpackages可以说是很贴心了。
其整合单细胞数据的核心函数之一就是:FindIntegrationAnchors {Seurat}
如其名称所指,是用来Finds the integration anchors的,他是如何做到的呢?
?FindIntegrationAnchors
但是在整合的时候有时候整合不成功,特别是应用之前的单细胞技术,识别的细胞较少的时候,如可能会报错:
Filtering Anchors
Error in nn2(data = cn.data2[nn.cells2, ], query = cn.data1[nn.cells1, :
Cannot find more nearest neighbours than there are points
github上有这个问题的讨论:https://github.com/satijalab/seurat/issues/997
有用的回答是这样的:
I experienced the same problem when integrating very heterogeneous datasets, and it seems to be related to the size of k.filter
parameter.In my case, lowering the default value of 200 to 150 did the job. However, I am still trying to figure out whether the results are meaningful, and the underlying biological rationale of it, so any explanation would be welcome!
那么k.filter
的影响到底是多大呢?我们来试一下。
我用Unravelling subclonal heterogeneity and aggressive disease states in TNBC through single-cell RNA-seq这篇文章的数据集:GSE118390.
library(R.utils)
library(tidyverse)
library(reticulate)
use_python("E:\\conda\\python.exe")
gunzip("D:\\Users\\Administrator\\Desktop\\RStudio\\con-cluster\\GSE118389_counts_rsem.txt.gz", remove=FALSE)
medata<-read.table("D:\\Users\\Administrator\\Desktop\\RStudio\\con-cluster\\GSE118389_counts_rsem.txt")
dim(medata)
table(str_split(colnames(medata),pattern="_",simplify=T)[,1])
PT039 PT058 PT081 PT084 PT089 PT126
341 96 288 286 333 190
Seurat基本流程:
library(Seurat)
pbmc <- CreateSeuratObject(counts = medata, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
pbmc <- ScaleData(pbmc,features=VariableFeatures(pbmc))
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
pbmc <- RunUMAP(pbmc, dims = 1:10)
pbmc <- RunTSNE(pbmc, dims = 1:10)
?DimPlot
table([email protected]$orig.ident)
PT039 PT058 PT081 PT089
127 58 57 237
DimPlot(pbmc, reduction = "umap",label = TRUE,group.by="orig.ident")
这就是没有整合的时候,四个样本的降维结果。
下面我们用不同的k.filter
参数来试一下,
pancreas.list <- SplitObject(pbmc, split.by = "orig.ident")
pancreas.list <- pancreas.list[c("PT039", "PT058", "PT081", "PT089")]
for (i in 1:length(pancreas.list)) {
pancreas.list[[i]] <- NormalizeData(pancreas.list[[i]], verbose = FALSE)
pancreas.list[[i]] <- FindVariableFeatures(pancreas.list[[i]], selection.method = "vst",
nfeatures = 2000, verbose = FALSE)
}
#reference.list <- pancreas.list[c("PT081", "PT089")]
SelectIntegrationFeatures(object.list = pancreas.list)
#?FindIntegrationAnchors
pancreas.anchors <- FindIntegrationAnchors(object.list = pancreas.list, dims = 1:20,k.anchor = 5,k.filter = 30)
#?IntegrateData
pancreas.integrated <- IntegrateData(anchorset = pancreas.anchors, dims = 1:20)
Merging dataset 3 into 1
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 2 into 1 3
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
Merging dataset 4 into 1 3 2
Extracting anchors for merged samples
Finding integration vectors
Finding integration vector weights
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Integrating data
pancreas.integrated <- ScaleData(pancreas.integrated,features=VariableFeatures(pbmc))
pancreas.integrated <- RunPCA(pancreas.integrated, features = VariableFeatures(object = pbmc))
pancreas.integrated <- FindNeighbors(pancreas.integrated, dims = 1:10)
pancreas.integrated <- FindClusters(pancreas.integrated, resolution = 0.5)
pancreas.integrated <- RunUMAP(pancreas.integrated, dims = 1:10)
pancreas.integrated <- RunTSNE(pancreas.integrated, dims = 1:10)
DimPlot(pancreas.integrated, reduction = "umap",label = TRUE,group.by="orig.ident")+ggtitle("k.filter = 30")
可见,seurat在整合多样本的时候并不会自动为研究者提供合适的参数,我们也不应这样要求他们。需要注意的是default
虽然是用的最多的,并不一定是最优的。
还有一种方式merge()即简单地讲多个数据集放到一起,并不运行整合分析。
pancreas.list <- SplitObject(pbmc, split.by = "orig.ident")
pancreas.list <- pancreas.list[c("PT039", "PT058", "PT081", "PT089")]
mpancreas.list<-pancreas.list[c("PT039", "PT058", "PT081")]
mymerg<-merge(pancreas.list$PT089,mpancreas.list)
mymerg <- NormalizeData(mymerg, normalization.method = "LogNormalize", scale.factor = 10000)
mymerg <- FindVariableFeatures(mymerg, selection.method = "vst", nfeatures = 2000)
mymerg <- ScaleData(mymerg,features=VariableFeatures(mymerg))
mymerg <- RunPCA(mymerg, features = VariableFeatures(object = mymerg))
mymerg <- FindNeighbors(mymerg, dims = 1:10)
mymerg <- FindClusters(mymerg, resolution = 0.5)
mymerg <- RunUMAP(mymerg, dims = 1:10)
mymerg <- RunTSNE(mymerg, dims = 1:10)
pm<-DimPlot(mymerg, reduction = "umap",label = TRUE,group.by="orig.ident")+ggtitle("merge")
CombinePlots(plots = list(p0,pm),legend="bottom")
原文链接:https://www.jianshu.com/p/ebc328f9fb73