Seurat安装
想要安装旧版本(目前是3.X,与2.X有较大差别)
install.packages('devtools')
# 替换成2.3.0版本
>devtools::install_version(package = 'Seurat', version = package_version('2.3.0'))
> library(Seurat)
Seurat2.X与Seurat3.X妄图切换(凑字数警告,并没有成功,我安装了个寂寞)
原路径
.libPaths()
[1] "C:/Users/kongrui/Documents/R/win-library/4.0"
[2] "C:/Program Files/R/R-4.0.0/library"
#安装Seurat2在D盘
.libPaths("D:/Program Files/R/R-4.0.0/library2")
install.packages("devtools")
require(devtools)
.libPaths("D:/Program Files/R/R-4.0.0/library2")
require(devtools)
devtools::install_version(package = 'Seurat', version = package_version('2.3.0'))
几个概念
- UMI (Unique Molecular Identifier)是一段固定长度的随机小片段,用以区别不同的PCR扩增模板
- barcode标记不同的样品或者细胞
- sparse-matrix representation:稀疏矩阵的概念,简单理解在Seurat中基因表达量为0者以
.
表示,可节省空间
Seurat2.X
-
counts
矩阵进来后被包装为对象,方便操作。 - 然后一定要经过
NormalizeData
和ScaleData
的操作 - 函数
FindVariableGenes
可以挑选适合进行下游分析的基因集。 - 函数
RunPCA
和RunTSNE
进行降维 - 函数
FindClusters
直接就分群了,非常方便 函数FindAllMarkers
可以对分群后各个亚群找标志基因。 - 函数
FeaturePlot
可以展示不同基因在所有细胞的表达量 - 函数
VlnPlot
可以展示不同基因在不同分群的表达量差异情况 函数DoHeatmap
可以选定基因集后绘制热图
Seurat3.X
- 与V2版本区别初了解
- 函数名称的变化
- genes —>features
- v2的nUMI和nGene分别改为了nCount_RNA、nFeature_RNA
- SubsetData函数 —> subset函数
- Seurat采用的是graph-based聚类方法,k-means方法在V3中不存在了
- 可视化降维 - UMAP(V3新出)
非线性降维——这个目的是为了可视化,而不是特征提取(PCA),虽然它也可以用来做特征提取;UMAP方法更加紧凑——在降维图上,同一cluster离得更近,不同cluster离得更远,作为一种后来的算法有一定的优点,但是t-SNE结构也能很好地反映cluster的空间结构
V3.X Workflow
pbmc.counts <- Read10X(data.dir = "~/Downloads/pbmc3k/filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.counts)
pbmc <- NormalizeData(object = pbmc)
pbmc <- FindVariableFeatures(object = pbmc)
pbmc <- ScaleData(object = pbmc)
pbmc <- RunPCA(object = pbmc)
pbmc <- FindNeighbors(object = pbmc)
pbmc <- FindClusters(object = pbmc)
pbmc <- RunTSNE(object = pbmc)
DimPlot(object = pbmc, reduction = "tsne")
3.X版本
- CreateSeuratObject() Setup the Seurat Object
Arguments
counts:Unnormalized data such as raw counts or TPMs未归一化的原始数据
project:Sets the project name for the Seurat object
assay:Name of the assay corresponding to the initial input data
min.cells:Include features detected in at least this many cells. Will subset the counts matrix as well. To reintroduce excluded features, create a new object with a lower cutoff. (设定阈值/门槛标准)每个基因至少要在三个细胞中有表达
min.features:Include cells where at least this many features are detected.每个细胞中至少可检测到200个基因(示例数据中的过滤条件,具体可见下文或官网教程)
meta.data:Additional cell-level metadata to add to the Seurat object. Should be a data frame where the rows are cell names and the columns are additional metadata fields.(可为临床数据或其他数据,信息的信息,数据的数据---属性等,要求行为细胞、列为信息的数据框,示例数据中为默认meta.data = NULL) - Standard pre-processing workflow:selection and filtration of cells data—— normalization and scaling—— detection of highly variable features
2.1 QC and selecting cells for further analysis
2.11PercentageFeatureSet():
Calculate the percentage of all counts that belong to a given set of features
相当于增加一个亚组,示例数据中为线粒体基因(scRNA-seq常用指标),即计算每个细胞中线粒体基因所占的百分比
PercentageFeatureSet(object, pattern = NULL, features = NULL, col.name = NULL, assay = NULL)
eg:pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
注意MT为大写,通配符“^”表示查找行首符合指定内容的行,这里为MT。
2.12可视化相关待过滤的指标
VlnPlot()
nCount_RNA
列:每个细胞的UMI数量,详细数据存贮于metadata中
nFeature_RNA
列:每个细胞的基因数,详细数据存贮于metadata中
FeatureScatter()
展示相关特征之间的关系,如UMI和基因数的关系,UMI和线粒体百分比
subset() (过滤函数,上文说过不同于2.X版本的)
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
将基因count数超过2500 & 小于200 & 线粒体占比5%以上的细胞过滤掉
参数选择,由上图 QC1可见,nFeature_RNA基本在1000-2000之间,2500及以上可能是测序过程中将2个细胞算为1个细胞,所以Feature_RNA偏离主体,所以选用2500(可根据自己数据真实情况调整摸索)。同样的,MT基本集中在5%以下,将偏离主体的删除掉。
***以上操作均为标准化前的简单准备工作,去掉偏离的细胞。
2.2 Normalizing the data
默认方法: “LogNormalize” ,结果存储在pbmc[["RNA"]]@data
log(表达量 X “scale factor (10,000 by default)”)
2.3 Identification of highly variable features (feature selection)
寻找cell to cell varaition 基因(Features that are highly expressed in some cells, and lowly expressed in others)
FindVariableFeatures()函数
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
By default, we return 2,000 features per dataset.默认每个dataset返回2000个基因
2.4 Scaling the data
The results of this are stored in pbmc[["RNA"]]@scale.data
线性转化使得基因在细胞的平均表达值为0,细胞间的差异值为1,主要为了降低高表达基因在下游分析中所占权重过大,掩盖其他基因间的差异(例如热图,有一个基因表达值极高,导致绘制的热图出现除这个基因为红色外其他全部为低表达颜色蓝色,掩盖差异,无法进行进一步分析)
pbmc <- ScaleData(pbmc, features = all.genes)
features = Vector of features names to scale/center. Default is variable features.
pbmc <- ScaleData(pbmc, vars.to.regress = c("percent.mt",nCount_RNA))
删除不必要技术噪音影响:vars.to.regress
:Variables to regress out (previously latent.vars in RegressOut). For example, nUMI, or percent.mito. - Perform linear dimensional reduction线性降维
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
3.1 PCA后可视化
VizDimLoadings()
Visualize top genes associated with reduction components
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
VizDimLoadings(object, dims = 1:5, nfeatures = 30, col = "blue", reduction = "pca", projected = FALSE, balanced = FALSE, ncol = NULL, combine = TRUE)
dims:Number of dimensions to display 展示维度的数量
nfeatures:Number of genes to display
VizDimLoadings(pbmc, dims = 1:3, nfeatures = 10, col = "pink", reduction = "pca")
改变参数后绘图
DimPlot(pbmc, reduction = "pca")
DimHeatmap()
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
dims = 1 只画第一个主成分
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
DimHeatmap's Arguments
object Seurat object
dims Dimensions to plot
nfeatures Number of genes to plot(default 30)
cells A list of cells to plot. If numeric, just plots the top cells.
3.2 Determine the ‘dimensionality’ of the dataset如何根据PCA值确定维度对细胞进行分类
方法1:JackStraw()函数
In this case it appears that there is a sharp drop-off in significance after the first 10-12 PCs.(PC13较之前的PC,p-value值有明显波动)
方法2:ElbowPlot()函数
此处一般可选10,即到PC10基本可以涵盖数据全貌,类似于WGCNA的β值 - Cluster the cells
细胞聚类
Seurat v3应用graph-based clustering方法,官方称其“diatance metric”距离度量方法较之前有较大进步,被很多人应用巴拉。。。(https://satijalab.org/seurat/v3.1/pbmc3k_tutorial.html)官网曰这种方法将细胞嵌入到一种图形结构中(例如 K-nearest neighbor (KNN) graph),这种图形会将具有相似的基因表达属性的细胞们绘制到一个社区内。
graph的绘制其实都是基于一定的统计学算法,例如KNNgraph是基于euclidean distance算法,并且重新定义细胞与细胞直接的连线(想象一下如果可视化的话联系的紧密程度可以简单画个线,用线的粗细表示相应距离的远近,或者社区内的细胞之间的重叠度大小表示紧密程度),PS:各种算法解释其实一定程度上可以参考dist()函数,如果统计及数学足够好的话可以深入研究下;上述工作可用FindNeighbors()函数实现
FindClusters()函数:完成细胞聚类
应用模块优化方法更好的完成细胞聚类,其中一个重要的参数resolution一般3K左右细胞resolution参数值可选择0.4-1.2之间(根据数据调整,不断调试)。应用Idents函数可以查看细胞被分到哪个组。 - Run non-linear dimensional reduction (UMAP/tSNE)
结果好像比tSNE聚的更紧凑 - 下游分析:寻找差异表达基因
6.1 FindMarkers()函数
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
ident.1第一群细胞的marker基因,并没有给出ident.2参数,说明是与其他所有组相比
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
ident.2 = c(0,3)说明是将第五群与第3及0群相比
将已分类好的群再次进行分类,要灵活掌握该函数的参数了或者强行进行亚分群后再走一遍Seurat流程。参考大神讲解https://cloud.tencent.com/developer/article/1621989
6.2 FindAllMarkers()
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
差异检验:
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE) ; head(cluster1.markers)
应用ROC方法检验,返回值“classification power”(范围0~1之间,越接近于1效果越好)
可视化
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"))
整体代码
rm(list = ls())
packageVersion("Seurat")
library(Seurat)
library(dplyr)
library(patchwork)
library(ggsci)###改变一下配色
#导入PBMC数据,用Read10×这个参数
pbmc.data <- Read10X(data.dir = "D:pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
?CreateSeuratObject
pbmc
#查看pbmc.data某个基因在前三十个细胞中的表达
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
# 把表达矩阵里的线粒体QC单独存放在Seurat对象里,等于添加一列到metadata的对象里
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head([email protected])
#展示数据中每个细胞UMI,基因数即线粒体比例
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
#展示相关特征之间的关系
plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
#将unique基因count数超过2500,或者小于200的细胞过滤掉,把线粒体count数占5%以上的细胞过滤掉
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
#标准化:表达量*10000后取log,结果存储在pbmc[["RNA"]]@data
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
#选取高变异基因
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
top10 <- head(VariableFeatures(pbmc), 10)
?FindVariableFeatures
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
#scale线性转换
all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")
?ScaleData
#降维 scale后PCA
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
?VizDimLoadings()
DimPlot(pbmc, reduction = "pca")
?DimPlot
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
?DimHeatmap
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStrawPlot(pbmc, dims = 1:15)
?JackStraw
?FindClusters
pbmc <- FindNeighbors(pbmc, dims = 1:10)
pbmc <- FindClusters(pbmc, resolution = 0.5)
head(Idents(pbmc), 5)
#降维可视化
pbmc <- RunUMAP(pbmc, dims = 1:10)
DimPlot(pbmc, reduction = "umap")
saveRDS(pbmc, file = "pbmc_tutorial.rds")
#差异表达基因
# find all markers of cluster 1
cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
?FindMarkers
markers <- FindMarkers(pbmc_small, ident.1 = "g1", group.by = 'groups', subset.ident = "2")
head(x = markers)
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
cluster1.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
head(cluster1.markers)
VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
"CD8A"))
top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono",
"NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
pbmc <- RenameIdents(pbmc, new.cluster.ids)
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()
saveRDS(pbmc, file = "pbmc3k_final.rds")