Analysis of single cell RNA-seq data 学习笔记(四)

那么接下来我们就要看一下如何分析单细胞数据
此次我们的示例数据是(Tungetal.,2017)的数据

质控

首先先读取数据

#加载包
library(SingleCellExperiment) library(scater) options(stringsAsFactors = FALSE)
#读取数据
molecules <- read.table("tung/molecules.txt", sep = "\t") 
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
molecules

我们可以知道每个cellID中每个基因的表达情况


anno

这一个表格包括了batch信息cellID等信息
接下来我们要构建单细胞对象:

umi <- SingleCellExperiment( assays = list(counts = as.matrix(molecules)), colData = anno )
#去除不表达基因
keep_feature <- rowSums(counts(umi) > 0) > 0 umi <- umi[keep_feature, ]

接下来定义内参基因和线粒体基因:

isSpike(umi, "ERCC") <- grepl("^ERCC-", rownames(umi)) 
isSpike(umi, "MT") <- rownames(umi) %in% c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888", "ENSG00000198886", "ENSG00000212907", "ENSG00000198786", "ENSG00000198695", "ENSG00000198712", "ENSG00000198804", "ENSG00000198763", "ENSG00000228253", "ENSG00000198938", "ENSG00000198840")

接着计算质量矩阵:

umi <- calculateQCMetrics( umi, feature_controls = list( ERCC = isSpike(umi, "ERCC"), MT = isSpike(umi, "MT") ) )

cell QC

1.库的大小

hist( umi$total_counts, breaks = 100 ) 
abline(v = 25000, col = "red")

这幅图表示每个cell检测到的总RNA的count数的分布情况,我们拟定小于25000的将被去除

2.捕获的genes

我们计算每个sample中唯一被检测到的基因的总数,即被检测到的基因有多少个(唯一的,无重复基因名)

hist( umi$total_features, breaks = 100 ) 
abline(v = 7000, col = "red")

由图可知大部分被检测到的基因数目大约在7000-10000个之间,那么对于小于7000个的cell我们考虑去除

3. ERCCs and MTs

这两个的比值也是衡量QC的一种办法,ERCCs\MTs,即内参基因表达量比上线粒体基因的表达量

plotPhenoData( umi, aes_string( x = "total_features", y = "pct_counts_MT", colour = "batch" ) ) 

这幅图的横坐标表示每个细胞的被检测到的唯一基因的总数,纵坐标表示每个细胞线粒体基因的表达count数

plotPhenoData( umi, aes_string( x = "total_features", y = "pct_counts_ERCC", colour = "batch" ) ) 

这幅图横坐标表示每个细胞的被检测到的唯一基因的总数,纵坐标表示每个细胞内参基因表达的count数

那么对应于每个cell,它被检测到的唯一基因的总数是一定的,结合两幅图即可计算ERCCs\MTs,若该值高,则说明捕获到的这个cell里面的RNA数量太少,可能要被去除掉(线粒体基因表达少,内参基因表达高,说明大部分捕捉到的是内参基因的RNA,而不是该cell内的RNA)
图中 NA19098.r2这个batch的细胞普遍RNA量都不高

那么就可以把不满足上述条件的cell去除掉

有个简便算法,我们可以先生成filter_by_expr_features ,filter_by_total_counts ,filter_by_ERCC 和filter_by_MT分别为四个条件的筛选(true和false),然后依次选均为true的cell即可:

#QC
umi <- calculateQCMetrics( reads, feature_controls = list( ERCC = isSpike(reads, "ERCC"), MT = isSpike(reads, "MT") ) ) 
#filter_by_total_counts
filter_by_total_counts <- (umi$total_counts > 1.3e6) 
table(filter_by_total_counts)
#filter_by_expr_features 
filter_by_expr_features <- (umi$total_features > 7000) 
table(filter_by_expr_features)
#filter_by_ERCC 
filter_by_ERCC <- umi$batch != "NA19098.r2" & umi$pct_counts_ERCC
#filter_by_MT
filter_by_MT <- reads$pct_counts_MT

#合并四个条件
umi$use <- ( 
# sufficient features (genes)  
  filter_by_expr_features  
# sufficient molecules counted 
  filter_by_total_counts 
# sufficient endogenous 
  RNA filter_by_ERCC  
# remove cells with unusual number of reads in MT genes 
  filter_by_MT ) 
例子

其中filter_by_MT是指线粒体基因,应当去除其中线粒体基因表达异常的cell

gene 筛选

我们保留某个基因至少在2个cell中都检测到且有1个count以上的基因

filter_genes <- apply( 
   counts(umi[ , 
   colData(umi)$use]), 1, 
   function(x) length(x[x > 1]) >= 2 ) 

rowData(umi)$use <- filter_genes 

table(filter_genes)

数据可视化

1.降维

我们首先来看PCA
QC过滤之前

plotPCA( 
  umi.qc[endog_genes, ], 
  exprs_values = "logcounts_raw", 
  colour_by = "batch", 
  size_by = "total_features", 
  shape_by = "individual" ) 


QC过滤之后

plotPCA( 
  umi.qc[endog_genes, ], 
  exprs_values = "logcounts_raw", 
  colour_by = "batch", 
  size_by = "total_features", 
  shape_by = "individual" ) 

不过在单细胞数据中一般不用线性降维(PCA),而用基于概率模型的t-SNE

我们来看看t-SNE
QC过滤之前

plotTSNE( umi[endog_genes, ], 
  exprs_values = "logcounts_raw", 
  perplexity = 130, colour_by = "batch", 
  size_by = "total_features", 
  shape_by = "individual",
  rand_seed = 123456
)


QC过滤之后

plotTSNE( umi.qc[endog_genes, ], 
  exprs_values = "logcounts_raw", 
  perplexity = 130, 
  colour_by = "batch", 
  size_by = "total_features", 
  shape_by = "individual", 
  rand_seed = 123456 ) 

你可能感兴趣的:(Analysis of single cell RNA-seq data 学习笔记(四))