在看完seurat官方分群教程后
https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html
感觉又收获很多东西,决定写下笔记,更好地去了解这个包,和单细胞测序数据的分析原理。图码较多~
数据为10x genomics官方测序数据(Illumina NextSeq 500),人外周血单核细胞,经过cell ranger过滤后,载入seurat 包进行分析,最终为13714genes*2638cells,分为9个亚群结合数据库分别鉴定为"Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet"。下面将根据官方教程和网上资料进行解析。
一、创建seurat对象、数据过滤出图
.矩阵中的值表示0(没有检测到分子)。由于scRNA-seq矩阵中的大多数值为0,因此Seurat尽可能使用稀疏矩阵表示。这样可以为Drop-seq / inDrop / 10x数据节省大量内存并节省速度。
```
# 指定(你的)数据所在目录;
setwd("")
data_dir <- ""
#载入seurat包;
library(dplyr)
library(Seurat)
#读入pbmc数据;
pbmc.data <- Read10X(data.dir = data_dir)
#查看稀疏矩阵的维度,即基因数和细胞数;
dim(pbmc.data)
#预览稀疏矩阵(1~10行,1~6列),. 表示0;
pbmc.data[1:10,1:6]
#2.创建Seurat对象与数据过滤
# 在使用CreateSeuratObject()创建对象的同时,过滤数据质量差的细胞。保留在>=3个细胞中表达的基因;保留能检测到>=200个基因的细胞。
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc2700", min.cells = 3, min.features = 200)
pbmc
# Lets examine a few genes in the first thirty cells
#pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
#看一般矩阵和稀疏矩阵的大小
dense.size <- object.size(as.matrix(pbmc.data))
dense.size
sparse.size <- object.size(pbmc.data)
sparse.size
# 计算每个细胞的线粒体基因转录本数的百分比(%),使用[[ ]] 操作符存放到metadata中;
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
# Show QC metrics for the first 5 cells
head([email protected], 5)
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot1
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot2
CombinePlots(plots = list(plot1, plot2))
```
过滤前基因、转录本和线粒体比例分布
过滤前转录本和线粒体基因、基因数散点图(可以发现基因数和转录本数成正相关,系数为0.95)
过滤后分布,可以发现基因数在200-2500,线粒体基因数在5%以下
二、数据标准化,pca降维,爪图肘图确定维度
```
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
# 对过滤后的 QC metrics进行可视化(绘制散点图);
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")+ NoLegend()
plot1
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")+ NoLegend()
plot2
CombinePlots(plots = list(plot1, plot2))
#表达量数据标准化:LogNormalize的算法:A = log( 1 + ( UMIA ÷ UMITotal ) × 10000 )
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
#pbmc <- NormalizeData(pbmc)
#鉴定表达高变基因(2000个),用于下游分析,如PCA;
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# 提取表达量变变化最高的10个基因;
top10 <- head(VariableFeatures(pbmc), 10)
top10
plot3 <- VariableFeaturePlot(pbmc)
plot4 <- LabelPoints(plot = plot3, points = top10, repel = TRUE,xnudge=0,ynudge=0)
plot4
#3. PCA分析;
#PCA分析数据准备,使用ScaleData()进行数据归一化;
#默认只是标准化高变基因(2000个),速度更快,不影响PCA和分群,但影响热图的绘制;
pbmc <- ScaleData(pbmc,vars.to.regress = "percent.mt")
#而对所有基因进行标准化的方法如下:
#all.genes <- rownames(pbmc)
#pbmc <- ScaleData(pbmc,features = all.genes,vars.to.regress = "percent.mt")
#线性降维(PCA),默认用高变基因集,但也可通过features参数自己指定;
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# 检查PCA 分群结果, 这里只展示前12个PC,每个PC只显示3个基因;
print(pbmc[["pca"]], dims = 1:12, nfeatures = 3)
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
#绘制pca散点图;
DimPlot(pbmc, reduction = "pca")
#画前2个主成分的热图;
DimHeatmap(pbmc, dims = 1:2, cells = 500, balanced = TRUE)
#4.确定数据集的维度个数;
#方法1:Jackstraw置换检验算法;重复取样(原数据的1%),重跑PCA,鉴定p-value较小的PC;计算‘null distribution’(即零假设成立时)时的基因scores;
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
#方法2:肘部图(碎石图),基于每个主成分对方差解释率的排名;
ElbowPlot(pbmc)
```
高可变基因集,横坐标为基因平均表达水平,纵坐标为基因表达标准差
前两个主成分所含的基因集
前两个主成分基因集热图
爪图以确定分多少个亚群
肘状图(到10时平缓,说明再分主成分意义不大)
三、非线性降维、差异基因热图
```
#基于PCA空间中的欧氏距离计算nearest neighbor graph,优化任意两个细胞间的距离权重(输入上一步得到的PC维数);
pbmc <- FindNeighbors(pbmc, dims = 1:10)
#接着优化模型,resolution参数决定下游聚类分析得到的分群数,对于3K左右的细胞,设为0.4-1.2 能得到较好的结果;如果数据量增大,该参数也应该适当增大;
pbmc <- FindClusters(pbmc, resolution = 0.5)
#使用Idents()函数可查看不同细胞的分群;
head(Idents(pbmc), 7)
#Seurat提供了几种非线性降维的方法进行数据可视化(在低维空间把相似的细胞聚在一起),比如UMAP和t-SNE,;
#运行UMAP需要先安装'umap-learn'包,这里不做介绍;
#umap聚类
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")
saveRDS(pbmc, file = "ump.rds")
###tsne聚类
pbmc <- RunTSNE(pbmc, dims = 1:10)
# 用DimPlot()函数绘制散点图, reduction = "tsne",指定绘制类型;如果不指定,默认先从搜索 umap,
#然后 tsne, 再然后 pca;也可以直接使用这3个函数PCAPlot()、TSNEPlot()、UMAPPlot()
# cols,pt.size 分别调整分组颜色和点的大小;
tsneplot<-TSNEPlot(pbmc,label = TRUE, pt.size = 0.5)+ NoLegend()
tsneplot
#绘制Marker 基因的基因映射图;
FeaturePlot(pbmc, features = c("MS4A1", "CD14"))
# 查找差异基因(biomarkers),可以使用FindAllMarkers自动查找差异基因,也可以逐一查找:方法是将当前cluster的细胞作为一组,其他cluster的细胞作为另一组,然后进行差异分析;
library(ggplot2)
# 19s计算1个cluster;min.pct=0.25 (minimum percentage)两个组中至少25%的细胞能检测到这个基因;用来过滤基因,加快运算速度;
###ident.1参数为比较的亚群如,亚群1和所有亚群
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 7)
#查找cluster 5 的差异基因(与clusters 0 和 3相比);
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
#差异分析的方法设置:默认秩和检验,也可设为ROC、t、DESeq2等多种方法;
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
###查找全部marker
markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25,return.thresh = 0.01)
markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
# find markers for every cluster compared to all remaining cells, report only the positive ones
#pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
#pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
#取top10的基因,用差异倍数排序; 备注:过滤过程用的是dplyr包的命令;%>%管道函数传输执行,类似于linux|
top10 <- markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
top10[c("cluster","gene")]
write.table(top10, file ="top10.txt", sep =" ", row.names =TRUE, col.names =TRUE, quote =TRUE)
# 用top10基因绘制热图,不画图例
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
#绘制基因小提琴图
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
# you can plot raw counts as well
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = T)
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
#5.为分群重新指定细胞类型;
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
#将pbmc的水平属性负值给new.cluster.ids的names属性;
names(new.cluster.ids)
#NULL
levels(pbmc)
#[1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
names(new.cluster.ids) <- levels(pbmc)
names(new.cluster.ids)
#[1] "0" "1" "2" "3" "4" "5" "6" "7" "8"
new.cluster.ids
pbmc <- RenameIdents(pbmc, new.cluster.ids)
#绘制tsne图(修改标签后的);
tsneplot2<-TSNEPlot(pbmc,label = TRUE, pt.size = 0.5)+ NoLegend()
tsneplot2
#画umap图
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
DimPlot
saveRDS(pbmc, file = "ump_end.rds")
```
umap降维结果图(图放大后点太小了,没调了,这里跟官方有点差异,因为坐标轴的原因,但是问题不大~)
t-sne降维聚类图
标记基因映射图
标记基因小提琴图
top10差异基因热图(这里绘图时有些基因没显示,The following features were omitted as they were not found in the scale.data slot for the RNA assay: CD8A, VPREB3, CD40LG, PIK3IP1, PRKCQ-AS1, NOSIP, LEF1, CD3E, CD3D, CCR7, LDHB, RPS3A,它们在scale.data插槽里没被发现,因此被忽略了)
通过基因热图,结合cellmarker数据库和相关文献,给分群后的图加上细胞名称(此为seurat官方图片)
亚群加名
下次绘图点要调大些~