捋一遍单细胞测序分群鉴定——基于seurat官网和网上资料~

​在看完seurat官方分群教程后

https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html

感觉又收获很多东西,决定写下笔记,更好地去了解这个包,和单细胞测序数据的分析原理。图码较多~

数据为10x genomics官方测序数据(Illumina NextSeq 500),人外周血单核细胞,经过cell ranger过滤后,载入seurat 包进行分析,最终为13714genes*2638cells,分为9个亚群结合数据库分别鉴定为"Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet"。下面将根据官方教程和网上资料进行解析。

一、创建seurat对象、数据过滤出图

.矩阵中的值表示0(没有检测到分子)。由于scRNA-seq矩阵中的大多数值为0,因此Seurat尽可能使用稀疏矩阵表示。这样可以为Drop-seq / inDrop / 10x数据节省大量内存并节省速度。

```

# 指定(你的)数据所在目录;

setwd("")

data_dir <- ""

#载入seurat包;

library(dplyr)

library(Seurat)

#读入pbmc数据;

pbmc.data <- Read10X(data.dir = data_dir)

#查看稀疏矩阵的维度,即基因数和细胞数;

dim(pbmc.data)

#预览稀疏矩阵(1~10行,1~6列),. 表示0;

pbmc.data[1:10,1:6]

#2.创建Seurat对象与数据过滤

# 在使用CreateSeuratObject()创建对象的同时,过滤数据质量差的细胞。保留在>=3个细胞中表达的基因;保留能检测到>=200个基因的细胞。

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc2700", min.cells = 3, min.features = 200)

pbmc

# Lets examine a few genes in the first thirty cells

#pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]

#看一般矩阵和稀疏矩阵的大小

dense.size <- object.size(as.matrix(pbmc.data))

dense.size

sparse.size <- object.size(pbmc.data)

sparse.size

# 计算每个细胞的线粒体基因转录本数的百分比(%),使用[[ ]] 操作符存放到metadata中;

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

# Show QC metrics for the first 5 cells

head([email protected], 5)

# Visualize QC metrics as a violin plot

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")

plot1

plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

plot2

CombinePlots(plots = list(plot1, plot2))

```

过滤前基因、转录本和线粒体比例分布

过滤前转录本和线粒体基因、基因数散点图(可以发现基因数和转录本数成正相关,系数为0.95)

过滤后分布,可以发现基因数在200-2500,线粒体基因数在5%以下

二、数据标准化,pca降维,爪图肘图确定维度

```

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

# 对过滤后的 QC metrics进行可视化(绘制散点图);

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")+ NoLegend()

plot1

plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")+ NoLegend()

plot2

CombinePlots(plots = list(plot1, plot2))

#表达量数据标准化:LogNormalize的算法:A = log( 1 + ( UMIA ÷ UMITotal ) × 10000 )

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

#pbmc <- NormalizeData(pbmc)

#鉴定表达高变基因(2000个),用于下游分析,如PCA;

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# 提取表达量变变化最高的10个基因;

top10 <- head(VariableFeatures(pbmc), 10)

top10

plot3 <- VariableFeaturePlot(pbmc)

plot4 <- LabelPoints(plot = plot3, points = top10, repel = TRUE,xnudge=0,ynudge=0)

plot4

#3. PCA分析;

#PCA分析数据准备,使用ScaleData()进行数据归一化;

#默认只是标准化高变基因(2000个),速度更快,不影响PCA和分群,但影响热图的绘制;

pbmc <- ScaleData(pbmc,vars.to.regress = "percent.mt")

#而对所有基因进行标准化的方法如下:

#all.genes <- rownames(pbmc)

#pbmc <- ScaleData(pbmc,features = all.genes,vars.to.regress = "percent.mt")

#线性降维(PCA),默认用高变基因集,但也可通过features参数自己指定;

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

# 检查PCA 分群结果, 这里只展示前12个PC,每个PC只显示3个基因;

print(pbmc[["pca"]], dims = 1:12, nfeatures = 3)

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

#绘制pca散点图;

DimPlot(pbmc, reduction = "pca")

#画前2个主成分的热图;

DimHeatmap(pbmc, dims = 1:2, cells = 500, balanced = TRUE)

#4.确定数据集的维度个数;

#方法1:Jackstraw置换检验算法;重复取样(原数据的1%),重跑PCA,鉴定p-value较小的PC;计算‘null distribution’(即零假设成立时)时的基因scores;

pbmc <- JackStraw(pbmc, num.replicate = 100)

pbmc <- ScoreJackStraw(pbmc, dims = 1:20)

JackStrawPlot(pbmc, dims = 1:15)

#方法2:肘部图(碎石图),基于每个主成分对方差解释率的排名;

ElbowPlot(pbmc)

```

高可变基因集,横坐标为基因平均表达水平,纵坐标为基因表达标准差

前两个主成分所含的基因集

前两个主成分基因集热图

爪图以确定分多少个亚群

肘状图(到10时平缓,说明再分主成分意义不大)

三、非线性降维、差异基因热图

```

#基于PCA空间中的欧氏距离计算nearest neighbor graph,优化任意两个细胞间的距离权重(输入上一步得到的PC维数);

pbmc <- FindNeighbors(pbmc, dims = 1:10)

#接着优化模型,resolution参数决定下游聚类分析得到的分群数,对于3K左右的细胞,设为0.4-1.2 能得到较好的结果;如果数据量增大,该参数也应该适当增大;

pbmc <- FindClusters(pbmc, resolution = 0.5)

#使用Idents()函数可查看不同细胞的分群;

head(Idents(pbmc), 7)

#Seurat提供了几种非线性降维的方法进行数据可视化(在低维空间把相似的细胞聚在一起),比如UMAP和t-SNE,;

#运行UMAP需要先安装'umap-learn'包,这里不做介绍;

#umap聚类

pbmc <- RunUMAP(pbmc, dims = 1:10)

DimPlot(pbmc, reduction = "umap")

saveRDS(pbmc, file = "ump.rds")

###tsne聚类

pbmc <- RunTSNE(pbmc, dims = 1:10)

# 用DimPlot()函数绘制散点图, reduction = "tsne",指定绘制类型;如果不指定,默认先从搜索 umap,

#然后 tsne, 再然后 pca;也可以直接使用这3个函数PCAPlot()、TSNEPlot()、UMAPPlot()

# cols,pt.size 分别调整分组颜色和点的大小;

tsneplot<-TSNEPlot(pbmc,label = TRUE, pt.size = 0.5)+ NoLegend()

tsneplot

#绘制Marker 基因的基因映射图;

FeaturePlot(pbmc, features = c("MS4A1", "CD14"))

# 查找差异基因(biomarkers),可以使用FindAllMarkers自动查找差异基因,也可以逐一查找:方法是将当前cluster的细胞作为一组,其他cluster的细胞作为另一组,然后进行差异分析;

library(ggplot2)

# 19s计算1个cluster;min.pct=0.25 (minimum percentage)两个组中至少25%的细胞能检测到这个基因;用来过滤基因,加快运算速度;

###ident.1参数为比较的亚群如,亚群1和所有亚群

cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)

head(cluster1.markers, n = 7)

#查找cluster 5 的差异基因(与clusters 0 和 3相比);

cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)

head(cluster5.markers, n = 5)

#差异分析的方法设置:默认秩和检验,也可设为ROC、t、DESeq2等多种方法;

cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)

###查找全部marker

markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25,return.thresh = 0.01)

markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)

# find markers for every cluster compared to all remaining cells, report only the positive ones

#pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)

#pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)

#取top10的基因,用差异倍数排序; 备注:过滤过程用的是dplyr包的命令;%>%管道函数传输执行,类似于linux|

top10 <- markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)

top10[c("cluster","gene")]

write.table(top10, file ="top10.txt", sep =" ", row.names =TRUE, col.names =TRUE, quote =TRUE)

# 用top10基因绘制热图,不画图例

DoHeatmap(pbmc, features = top10$gene) + NoLegend()

#绘制基因小提琴图

VlnPlot(pbmc, features = c("MS4A1", "CD79A"))

# you can plot raw counts as well

VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = T)

FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",

                              "CD8A"))

#5.为分群重新指定细胞类型;

new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono",

                    "NK", "DC", "Platelet")

#将pbmc的水平属性负值给new.cluster.ids的names属性;

names(new.cluster.ids)

#NULL

levels(pbmc)

#[1] "0" "1" "2" "3" "4" "5" "6" "7" "8"

names(new.cluster.ids) <- levels(pbmc)

names(new.cluster.ids)

#[1] "0" "1" "2" "3" "4" "5" "6" "7" "8"

new.cluster.ids

pbmc <- RenameIdents(pbmc, new.cluster.ids)

#绘制tsne图(修改标签后的);

tsneplot2<-TSNEPlot(pbmc,label = TRUE, pt.size = 0.5)+ NoLegend()

tsneplot2

#画umap图

DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

DimPlot

saveRDS(pbmc, file = "ump_end.rds")

```

umap降维结果图(图放大后点太小了,没调了,这里跟官方有点差异,因为坐标轴的原因,但是问题不大~)

t-sne降维聚类图

标记基因映射图

标记基因小提琴图

top10差异基因热图(这里绘图时有些基因没显示,The following features were omitted as they were not found in the scale.data slot for the RNA assay: CD8A, VPREB3, CD40LG, PIK3IP1, PRKCQ-AS1, NOSIP, LEF1, CD3E, CD3D, CCR7, LDHB, RPS3A,它们在scale.data插槽里没被发现,因此被忽略了)

通过基因热图,结合cellmarker数据库和相关文献,给分群后的图加上细胞名称(此为seurat官方图片)

亚群加名

下次绘图点要调大些~

你可能感兴趣的:(捋一遍单细胞测序分群鉴定——基于seurat官网和网上资料~)