教程来自ANALYSIS OF SINGLE CELL RNA-SEQ DATAANALYSIS OF SINGLE CELL RNA-SEQ DATAhttps://broadinstitute.github.io/KrumlovSingleCellWorkshop2020/index.html
本文主要针对Seurat官网教程上一些难懂的部分进行分享。
scRNA-Seq和bulk RNA-Seq两个关键的不同是drop-out和potential for quality control (QC) metrics,前者是指单细胞中有限的RNA探测导致很多的0值,也正因如此,使用sparse matrix而不是dense matrix进行存储,这样节省存储空间,因为sparse matrix默认大多数为0(用 . 表示),只存储非零值。
识别分析细胞亚群之前,质量控制非常重要
counts <- Read10X(data.dir = counts_matrix_filename) # Seurat function to read in 10x count data
将文件读入后得到sparse Matrix of class "dgCMatrix"
rownames是不同的gene名(feature),colnames是不同的16 nt 长barcode序列(细胞),可以通过dim(counts)等函数查看matrix的行列数,即总细胞数和基因数。
通过 Matrix::colSums(counts) 计算每一列的count和,即counts per cell
通过 Matrix::rowSums(counts) 计算每一行的count和,即counts per gene
而 Matrix::colSums(counts > 0)则是分别计算每一列中counts大于0的有多少行,即细胞中有多少个基因被表达了,colSums 和 rowSums函数返回的是向量vector形式。
通过plot图可以观察到failed libraries (lower end outliers) 或cell doublets (higher end outliers).
plot(sort(genes_per_cell), xlab='cell', log='y', main='genes per cell (ordered)')
seurat <- CreateSeuratObject(counts = counts, min.cells = 3, min.features = 350, project = "10X_NSCLC")
留下在3个或更多细胞中表达的基因,表达350个或更多基因的细胞。
seurat对象包括多个slots ,不仅存储原始的count数据,也存储计算后的结果,会随着分析不断更新。如meta.data,是一个数据框,原始变量包括orig.ident, nCount_RNA, nFeature_RNA,可以用$或[引用。
ps:orig.ident指cell identity,细胞身份,此时它的identity是指10X_NSCLC,但在细胞聚类后,则指细胞所属的cluster
GetAssayData(object = seurat, slot = 'data')[1:10, 1:10]
utils::methods(class = 'Seurat')
ls("package:Seurat")
# The [[ operator can add columns to object metadata. This is a great place to stash QC
seurat[["percent.mt"]] <- PercentageFeatureSet(object = seurat, pattern = "^MT-")
VlnPlot(object = seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
# Load the the list of house keeping genes
hkgenes <- read.table(paste0(dirname, "housekeepers.txt"), skip = 2)
hkgenes <- as.vector(hkgenes$V1)
# remove hkgenes that were not found
hkgenes.found <- which(toupper(rownames(seurat@assays$RNA@data)) %in% hkgenes)
n.expressed.hkgenes <- Matrix::colSums(seurat@assays$RNA@data[hkgenes.found, ] > 0)
seurat <- AddMetaData(object = seurat, metadata = n.expressed.hkgenes, col.name = "n.exp.hkgenes")
VlnPlot(object = seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt", "n.exp.hkgenes"), ncol = 2)
seurat <- SubsetData(object = seurat, subset.names = c("nFeature_RNA", "percent.mito","n.exp.hkgenes"), low.thresholds = c(350, -Inf,55), high.thresholds = c(5000, 0.1, Inf))
seurat <- subset(seurat, nFeature_RNA < 5000)
seurat <- subset(seurat, nFeature_RNA > 350)
seuart <- subset(seurat, n.exp.hkgenes > 55)
seuart <- subset(seurat, percent.mt < 10)
seurat <- NormalizeData(object = seurat, normalization.method = "LogNormalize", scale.factor = 1e4)
seurat <- FindVariableFeatures(object = seurat, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(seurat), 10)
seurat <- FindVariableFeatures(object = seurat, selection.method = "vst", nfeatures = 2000,
mean.function = ExpMean, dispersion.function = LogVMR,
num.bin = 40,
mean.cutoff = c(0.0125, 1),
dispersion.cutoff = c(0, 0.5))
## Check number of variable genes
length(seurat@[email protected])
selection.method
How to choose top variable features. Choose one of :vst: First, fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the feature values using the observed mean and expected variance (given by the fitted line). Feature variance is then calculated on the standardized values after clipping to a maximum (see clip.max parameter).
mean.var.plot (mvp): First, uses a function to calculate average expression (mean.function) and dispersion (dispersion.function) for each feature. Next, divides features into num.bin (deafult 20) bins based on their average expression, and calculates z-scores for dispersion within each bin. The purpose of this is to identify variable features while controlling for the strong relationship between variability and average expression.
dispersion (disp): selects the genes with the highest dispersion values
关于selection的3种方法,下期再叙!