自学Seurat的一点记录

教程来自ANALYSIS OF SINGLE CELL RNA-SEQ DATAANALYSIS OF SINGLE CELL RNA-SEQ DATAhttps://broadinstitute.github.io/KrumlovSingleCellWorkshop2020/index.html

本文主要针对Seurat官网教程上一些难懂的部分进行分享。

 scRNA-Seq和bulk RNA-Seq两个关键的不同是drop-out和potential for quality control (QC) metrics,前者是指单细胞中有限的RNA探测导致很多的0值,也正因如此,使用sparse matrix而不是dense matrix进行存储,这样节省存储空间,因为sparse matrix默认大多数为0(用 . 表示),只存储非零值。

识别分析细胞亚群之前,质量控制非常重要

counts <- Read10X(data.dir = counts_matrix_filename)  # Seurat function to read in 10x count data

 将文件读入后得到sparse Matrix of class "dgCMatrix"

rownames是不同的gene名(feature),colnames是不同的16 nt 长barcode序列(细胞),可以通过dim(counts)等函数查看matrix的行列数,即总细胞数和基因数。

过滤低质量的细胞

通过 Matrix::colSums(counts) 计算每一列的count和,即counts per cell

通过 Matrix::rowSums(counts) 计算每一行的count和,即counts per gene

而 Matrix::colSums(counts > 0)则是分别计算每一列中counts大于0的有多少行,即细胞中有多少个基因被表达了,colSums 和 rowSums函数返回的是向量vector形式。

通过探测的基因数量(文库复杂性)对细胞排序

通过plot图可以观察到failed libraries (lower end outliers) 或cell doublets (higher end outliers).

plot(sort(genes_per_cell), xlab='cell', log='y', main='genes per cell (ordered)')

使用Seurat beginning

  • 创建Seurat对象
    seurat <- CreateSeuratObject(counts = counts, min.cells = 3, min.features = 350, project = "10X_NSCLC")

    留下在3个或更多细胞中表达的基因,表达350个或更多基因的细胞。

seurat对象包括多个slots ,不仅存储原始的count数据,也存储计算后的结果,会随着分析不断更新。如meta.data,是一个数据框,原始变量包括orig.ident, nCount_RNA, nFeature_RNA,可以用$或[引用。

ps:orig.ident指cell identity,细胞身份,此时它的identity是指10X_NSCLC,但在细胞聚类后,则指细胞所属的cluster

  • 获取对象信息
GetAssayData(object = seurat, slot = 'data')[1:10, 1:10]
  • 获取seurat包含的method  列出seurat的所有函数
    utils::methods(class = 'Seurat')
    ls("package:Seurat")
    

    预处理1:过滤掉低质量细胞

  • 除去被破坏的受伤细胞,当细胞应急凋亡,会导致线粒体泄露和RNA降解,因此会有线粒体来源基因的富集,计算每个细胞中线粒体转录本的比例percent.mt,作小提琴图。
# The [[ operator can add columns to object metadata. This is a great place to stash QC

seurat[["percent.mt"]] <- PercentageFeatureSet(object = seurat, pattern = "^MT-")
VlnPlot(object = seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
  • 管家基因:细胞中管家基因的表达数量可以反映细胞常规过程的普遍活跃水平,风度高,稳定表达,对dropout不敏感。计算每个细胞中有多少管家基因表达。
# Load the the list of house keeping genes
hkgenes <- read.table(paste0(dirname, "housekeepers.txt"), skip = 2)
hkgenes <- as.vector(hkgenes$V1)

# remove hkgenes that were not found
hkgenes.found <- which(toupper(rownames(seurat@assays$RNA@data)) %in% hkgenes)

n.expressed.hkgenes <- Matrix::colSums(seurat@assays$RNA@data[hkgenes.found, ] > 0)
seurat <- AddMetaData(object = seurat, metadata = n.expressed.hkgenes, col.name = "n.exp.hkgenes")
VlnPlot(object = seurat, features = c("nFeature_RNA", "nCount_RNA", "percent.mt", "n.exp.hkgenes"), ncol = 2)
  • 过滤
    seurat <- SubsetData(object = seurat, subset.names = c("nFeature_RNA", "percent.mito","n.exp.hkgenes"), low.thresholds = c(350, -Inf,55), high.thresholds = c(5000, 0.1, Inf))
    
    seurat <- subset(seurat, nFeature_RNA < 5000)
    seurat <- subset(seurat, nFeature_RNA > 350)
    seuart <- subset(seurat, n.exp.hkgenes > 55)
    seuart <- subset(seurat, percent.mt < 10)

    预处理2:表达标准化

  • LogNormalize:用细胞中该基因的counts数除以细胞中的counts总数,再乘scale因子,默认10的4次方,再进行对数转换。可以稳定方差
    seurat <- NormalizeData(object = seurat, normalization.method = "LogNormalize", scale.factor = 1e4)

    差异基因的检测

  • 计算每个基因在细胞中的平均表达量和细胞间离散程度z-score
    seurat <- FindVariableFeatures(object = seurat, selection.method = "vst", nfeatures = 2000)
    
    # Identify the 10 most highly variable genes
    top10 <- head(VariableFeatures(seurat), 10)
    
    seurat <- FindVariableFeatures(object = seurat, selection.method = "vst", nfeatures = 2000, 
                                   mean.function = ExpMean, dispersion.function = LogVMR, 
                                   num.bin = 40, 
                                   mean.cutoff = c(0.0125, 1),
                                   dispersion.cutoff = c(0, 0.5))
    
    ## Check number of variable genes
    length(seurat@[email protected])

    selection.method    
    How to choose top variable features. Choose one of :

    vst: First, fits a line to the relationship of log(variance) and log(mean) using local polynomial regression (loess). Then standardizes the feature values using the observed mean and expected variance (given by the fitted line). Feature variance is then calculated on the standardized values after clipping to a maximum (see clip.max parameter).

    mean.var.plot (mvp): First, uses a function to calculate average expression (mean.function) and dispersion (dispersion.function) for each feature. Next, divides features into num.bin (deafult 20) bins based on their average expression, and calculates z-scores for dispersion within each bin. The purpose of this is to identify variable features while controlling for the strong relationship between variability and average expression.

    dispersion (disp): selects the genes with the highest dispersion values

关于selection的3种方法,下期再叙! 

你可能感兴趣的:(深度学习,人工智能)