【包】seurat-1 细胞分类

1. Guided Clustering Tutorial

Read10X读来自10X的数据,读出来是行为特征,列为细胞的矩阵
CreateSeuratObject构建seurat对象,用于后续分析(稀疏矩阵、节省空间)

library(dplyr)
library(Seurat)
library(patchwork)

# Load the PBMC dataset
pbmc.data <- Read10X(data.dir = "../data/pbmc3k/filtered_gene_bc_matrices/hg19/")
# Initialize the Seurat object with the raw (non-normalized data).
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
pbmc

1.1. 预处理

Filtration of cells based on QC metrics:Classification of low quality cells from single-cell RNA-seq data - PMC (nih.gov)这个限制大多是人为设定的,比如细胞内基因数、线粒体基因比例等

# The [[ operator can add columns to object **metadata**. This is a great place to stash QC stats,metadata为umi打头的原始信息,如下:
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

The number of unique genes and total molecules are automatically calculated during CreateSeuratObject

head([email protected], 5)
##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898

VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
#根据小提琴图设置阈值
#We filter cells that have unique feature counts over 2,500 or less than 200
#We filter cells that have >5% mitochondrial counts

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
#subset?
小提琴图

1.2. Data normalization and scaling

#a scale factor (10,000 by default), and log-transforms the result

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

1.3. Detection of highly variable features

通过变异性最强的基因来进行后续研究,使用均值方差定义变异性。FindVariableFeatures()默认找到2000个基因
Accounting for technical noise in single-cell RNA-seq experiments | Nature Methods

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
#这绘图方式也太方便了,要给自己写整套的绘图代码方便后续工作
变异基因

2. scaling

降维之前必要操作,每个基因在各个细胞间表达量转化为标准正态分布,使得各个基因的权重一致。存储在pbmc[["RNA"]]@scale.data,原本的表达量在pbmc[["RNA"]]@counts。

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
#但是上面所有基因都scaling太慢了
#the default in ScaleData() is only to perform scaling on the previously identified variable features (2,000 by default).
pbmc <- ScaleData(pbmc)

#并且该函数支持去除特定features,如线粒体、细胞周期基因等
pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")

v3有了SCTransform(),现在推荐用这个
介绍:Using sctransform in Seurat • Seurat (satijalab.org)
具体方法:Normalization and variance stabilization of single-cell RNA-seq data using regularized negative binomial regression | Genome Biology | Full Text (biomedcentral.com)

3. 线性降维

PCA
By default, only the previously determined variable features are used as input, but can be defined using features argument if you wish to choose a different subset.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

#可视化细胞与features,VizDimReduction()
#丰富的可视化方法是非常有利的
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7
#这里基因是行,细胞是列,所以主成分是细胞

VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
点图
DimPlot(pbmc, reduction = "pca")
二维点图
#cells适用于很多画图函数,会选择展示两边极端的细胞,
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
热图

4. 决定亚群数目

Seurat clusters cells based on their PCA scores, with each PC essentially representing a ‘metafeature’ that combines information across a correlated feature set.
4.1. JackStraw procedure
Permute a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of feature scores, and repeat this procedure. identify ‘significant’ PCs as those who have a strong enrichment of low p-value features.

# NOTE: This process can take a long time for big datasets, comment out for expediency. More
# approximate techniques such as those implemented in ElbowPlot() can be used to reduce
# computation time
pbmc <- JackStraw(pbmc, num.replicate = 100)
pbmc <- ScoreJackStraw(pbmc, dims = 1:20)
JackStraw

4.2. heuristic method
PCA标准方案,看方差解释度
a ranking of principle components based on the percentage of variance explained by each one

ElbowPlot(pbmc)
经典

4.3. 这个选择几个PC是一个相对主观的问题,没有一个软阈值。还有一些方法可以帮助确定PC数。比如个人感觉最好的就是多思考,去后面做亚群鉴定,GSEA等等。

5. 聚类

距离矩阵:the distance metric which drives the clustering analysis (based on previously identified PCs) remains the same.
聚类算法:应该是以图论为基础的图像学习。embed cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar feature expression patterns, and then attempt to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’.
具体原理看文章:
http://bioinformatics.oxfordjournals.org/content/early/2015/02/10/bioinformatics.btv088.abstract
http://www.ncbi.nlm.nih.gov/pubmed/26095251
seurat中的实现:

  1. As in PhenoGraph, we first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity).
  2. modularity optimization techniques such as the Louvain algorithm (default) or SLM, to iteratively group cells together, with the goal of optimizing the standard modularity function.
#计算距离与权重矩阵
pbmc <- FindNeighbors(pbmc, dims = 1:10)
#无监督聚类,resolution参数set ‘granularity’ of the downstream clustering
#We find that setting this parameter between 0.4-1.2 typically returns good results for single-cell datasets of around 3K cells. Optimal resolution often increases for larger datasets. 
pbmc <- FindClusters(pbmc, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 95927
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8728
## Number of communities: 9
## Elapsed time: 0 seconds

#The clusters can be found using the Idents() function.
# Look at cluster IDs of the first 5 cells
head(Idents(pbmc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                2                3                2                1 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

6. 非线性聚类可视化(UMAP/tSNE)

Cells within the graph-based clusters determined above should co-localize on these dimension reduction plots. As input to the UMAP and tSNE, we suggest using the same PCs as input to the clustering analysis.
高维数据投射到低维后基本都是线性不可分的,所以在PCA后用非线性来展示

# If you haven't installed UMAP, you can do so via reticulate::py_install(packages =
# 'umap-learn')
pbmc <- RunUMAP(pbmc, dims = 1:10)

# note that you can set `label = TRUE` or use the LabelClusters function to help label
# individual clusters
DimPlot(pbmc, reduction = "umap")
聚类展示

可以保存啦~~You can save the object at this point so that it can easily be loaded back in without having to rerun the computationally intensive steps performed above, or easily shared with collaborators.

saveRDS(pbmc, file = "../output/pbmc_tutorial.rds")

7. cluster biomarkers

Seurat can help you find markers that define clusters via differential expression
By default, it identifies positive and negative markers of a single cluster (specified in ident.1), compared to all other cells.
you can also test groups of clusters vs. each other, or against all cells

#FindMarkers()、FindAllMarkers()
#参数min.pct:a feature to be detected at a minimum percentage in either of the two groups of cells
#参数thresh.test: a feature to be differentially expressed (on average) by some amount between the two groups
#参数max.cells.per.ident:downsample each identity class to have no more cells than whatever this is set to

# find all markers of cluster 2
cluster2.markers <- FindMarkers(pbmc, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
##             p_val avg_log2FC pct.1 pct.2    p_val_adj
## IL32 2.892340e-90  1.2013522 0.947 0.465 3.966555e-86
## LTB  1.060121e-86  1.2695776 0.981 0.643 1.453850e-82
## CD3D 8.794641e-71  0.9389621 0.922 0.432 1.206097e-66
## IL7R 3.516098e-68  1.1873213 0.750 0.326 4.821977e-64
## LDHB 1.642480e-67  0.8969774 0.954 0.614 2.252497e-63


# find all markers distinguishing cluster 5 from clusters 0 and 3
cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
##                       p_val avg_log2FC pct.1 pct.2     p_val_adj
## FCGR3A        8.246578e-205   4.261495 0.975 0.040 1.130936e-200
## IFITM3        1.677613e-195   3.879339 0.975 0.049 2.300678e-191
## CFD           2.401156e-193   3.405492 0.938 0.038 3.292945e-189
## CD68          2.900384e-191   3.020484 0.926 0.035 3.977587e-187
## RP11-290F20.3 2.513244e-186   2.720057 0.840 0.017 3.446663e-182


# find markers for every cluster compared to all remaining cells, report only the positive ones
pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
pbmc.markers %>%
    group_by(cluster) %>%
    slice_max(n = 2, order_by = avg_log2FC)
## # A tibble: 18 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene    
##                            
##  1 9.57e- 88       1.36 0.447 0.108 1.31e- 83 0       CCR7    
##  2 3.75e-112       1.09 0.912 0.592 5.14e-108 0       LDHB    
##  3 0               5.57 0.996 0.215 0         1       S100A9  
##  4 0               5.48 0.975 0.121 0         1       S100A8  
##  5 1.06e- 86       1.27 0.981 0.643 1.45e- 82 2       LTB     
##  6 2.97e- 58       1.23 0.42  0.111 4.07e- 54 2       AQP3    
##  7 0               4.31 0.936 0.041 0         3       CD79A   
##  8 9.48e-271       3.59 0.622 0.022 1.30e-266 3       TCL1A   
##  9 5.61e-202       3.10 0.983 0.234 7.70e-198 4       CCL5    
## 10 7.25e-165       3.00 0.577 0.055 9.95e-161 4       GZMK    
## 11 3.51e-184       3.31 0.975 0.134 4.82e-180 5       FCGR3A  
## 12 2.03e-125       3.09 1     0.315 2.78e-121 5       LST1    
## 13 3.13e-191       5.32 0.961 0.131 4.30e-187 6       GNLY    
## 14 7.95e-269       4.83 0.961 0.068 1.09e-264 6       GZMB    
## 15 1.48e-220       3.87 0.812 0.011 2.03e-216 7       FCER1A  
## 16 1.67e- 21       2.87 1     0.513 2.28e- 17 7       HLA-DPB1
## 17 1.92e-102       8.59 1     0.024 2.63e- 98 8       PPBP    
## 18 9.25e-186       7.29 1     0.011 1.27e-181 8       PF4
#slice_max

多种方式差异表达基因验证:
Differential expression testing • Seurat (satijalab.org)
如ROC:

#test.use parameter
cluster0.markers <- FindMarkers(pbmc, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)

可视化biomarker表达:
VlnPlot() :shows expression probability distributions across clusters
FeaturePlot() :visualizes feature expression on a tSNE or PCA plot(our most commonly used visualizations)
RidgePlot(), CellScatter(), and DotPlot() (additional methods to view your dataset)
DoHeatmap():generates an expression heatmap for given cells and features.

VlnPlot(pbmc, features = c("MS4A1", "CD79A"))
小提琴图
# you can plot raw counts as well,slot参数
VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)
count slot
FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
    "CD8A"))
features图
pbmc.markers %>%
    group_by(cluster) %>%
    top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc, features = top10$gene) + NoLegend()
热图

8. 亚群鉴定

use canonical markers to easily match the unbiased clustering to known cell types


亚群鉴定
new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)#创建命名向量
pbmc <- RenameIdents(pbmc, new.cluster.ids)#rename函数,seurat有太多小函数了啊
DimPlot(pbmc, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()#能够去掉legend,这真的把画图函数玩明白了,好方便啊,好直白啊
亚群图,包含命名
#最后保存
saveRDS(pbmc, file = "../output/pbmc3k_final.rds")

你可能感兴趣的:(【包】seurat-1 细胞分类)