那么接下来我们就要看一下如何分析单细胞数据
此次我们的示例数据是(Tungetal.,2017)的数据
质控
首先先读取数据
#加载包
library(SingleCellExperiment) library(scater) options(stringsAsFactors = FALSE)
#读取数据
molecules <- read.table("tung/molecules.txt", sep = "\t")
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
我们可以知道每个cellID中每个基因的表达情况
这一个表格包括了batch信息cellID等信息
接下来我们要构建单细胞对象:
umi <- SingleCellExperiment( assays = list(counts = as.matrix(molecules)), colData = anno )
#去除不表达基因
keep_feature <- rowSums(counts(umi) > 0) > 0 umi <- umi[keep_feature, ]
接下来定义内参基因和线粒体基因:
isSpike(umi, "ERCC") <- grepl("^ERCC-", rownames(umi))
isSpike(umi, "MT") <- rownames(umi) %in% c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888", "ENSG00000198886", "ENSG00000212907", "ENSG00000198786", "ENSG00000198695", "ENSG00000198712", "ENSG00000198804", "ENSG00000198763", "ENSG00000228253", "ENSG00000198938", "ENSG00000198840")
接着计算质量矩阵:
umi <- calculateQCMetrics( umi, feature_controls = list( ERCC = isSpike(umi, "ERCC"), MT = isSpike(umi, "MT") ) )
cell QC
1.库的大小
hist( umi$total_counts, breaks = 100 )
abline(v = 25000, col = "red")
这幅图表示每个cell检测到的总RNA的count数的分布情况,我们拟定小于25000的将被去除
2.捕获的genes
我们计算每个sample中唯一被检测到的基因的总数,即被检测到的基因有多少个(唯一的,无重复基因名)
hist( umi$total_features, breaks = 100 )
abline(v = 7000, col = "red")
由图可知大部分被检测到的基因数目大约在7000-10000个之间,那么对于小于7000个的cell我们考虑去除
3. ERCCs and MTs
这两个的比值也是衡量QC的一种办法,ERCCs\MTs,即内参基因表达量比上线粒体基因的表达量
plotPhenoData( umi, aes_string( x = "total_features", y = "pct_counts_MT", colour = "batch" ) )
这幅图的横坐标表示每个细胞的被检测到的唯一基因的总数,纵坐标表示每个细胞线粒体基因的表达count数
plotPhenoData( umi, aes_string( x = "total_features", y = "pct_counts_ERCC", colour = "batch" ) )
这幅图横坐标表示每个细胞的被检测到的唯一基因的总数,纵坐标表示每个细胞内参基因表达的count数
那么对应于每个cell,它被检测到的唯一基因的总数是一定的,结合两幅图即可计算ERCCs\MTs,若该值高,则说明捕获到的这个cell里面的RNA数量太少,可能要被去除掉(线粒体基因表达少,内参基因表达高,说明大部分捕捉到的是内参基因的RNA,而不是该cell内的RNA)
图中 NA19098.r2这个batch的细胞普遍RNA量都不高
那么就可以把不满足上述条件的cell去除掉
有个简便算法,我们可以先生成filter_by_expr_features ,filter_by_total_counts ,filter_by_ERCC 和filter_by_MT分别为四个条件的筛选(true和false),然后依次选均为true的cell即可:
#QC
umi <- calculateQCMetrics( reads, feature_controls = list( ERCC = isSpike(reads, "ERCC"), MT = isSpike(reads, "MT") ) )
#filter_by_total_counts
filter_by_total_counts <- (umi$total_counts > 1.3e6)
table(filter_by_total_counts)
#filter_by_expr_features
filter_by_expr_features <- (umi$total_features > 7000)
table(filter_by_expr_features)
#filter_by_ERCC
filter_by_ERCC <- umi$batch != "NA19098.r2" & umi$pct_counts_ERCC
#filter_by_MT
filter_by_MT <- reads$pct_counts_MT
#合并四个条件
umi$use <- (
# sufficient features (genes)
filter_by_expr_features
# sufficient molecules counted
filter_by_total_counts
# sufficient endogenous
RNA filter_by_ERCC
# remove cells with unusual number of reads in MT genes
filter_by_MT )
其中filter_by_MT是指线粒体基因,应当去除其中线粒体基因表达异常的cell
gene 筛选
我们保留某个基因至少在2个cell中都检测到且有1个count以上的基因
filter_genes <- apply(
counts(umi[ ,
colData(umi)$use]), 1,
function(x) length(x[x > 1]) >= 2 )
rowData(umi)$use <- filter_genes
table(filter_genes)
数据可视化
1.降维
我们首先来看PCA
QC过滤之前
plotPCA(
umi.qc[endog_genes, ],
exprs_values = "logcounts_raw",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual" )
QC过滤之后
plotPCA(
umi.qc[endog_genes, ],
exprs_values = "logcounts_raw",
colour_by = "batch",
size_by = "total_features",
shape_by = "individual" )
不过在单细胞数据中一般不用线性降维(PCA),而用基于概率模型的t-SNE
我们来看看t-SNE
QC过滤之前
plotTSNE( umi[endog_genes, ],
exprs_values = "logcounts_raw",
perplexity = 130, colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
rand_seed = 123456
)
QC过滤之后
plotTSNE( umi.qc[endog_genes, ],
exprs_values = "logcounts_raw",
perplexity = 130,
colour_by = "batch",
size_by = "total_features",
shape_by = "individual",
rand_seed = 123456 )