网上发现一个single-cell sequencing analysis教程,出自莱顿大学计算生物学中心(Leiden Computational Biology Center),有代码、有数据、有注释,有讲解,真是入门单细胞的极佳材料。所谓“上士闻道,勤而习之”,那就赶紧学习吧!
源码链接: https://github.com/LeidenCBC/MGC-BioSB-SingleCellAnalysis2020
Quality Control
Created by: Ahmed Mahfouz
Overview
In this practical, we will walk through a pipeline to analyze single cell RNA-sequencing (scRNA-seq) data. Starting from a count matrix, we will cover the following steps of the analysis: 1. Quality control 2. Normalization 3. Feature selection
Datasets
For this tutorial we will use 3 different PBMC datasets from the 10x Genomics website (https://support.10xgenomics.com/single-cell-gene-expression/datasets).
- 1k PBMCs using 10x v2 chemistry
- 1k PBMCs using 10x v3 chemistry
- 1k PBMCs using 10x v3 chemistry in combination with cell surface proteins, but disregarding the protein data and only looking at gene expression.
The datasets are available in this repository.
Load required packages:
suppressMessages(require(Seurat))
suppressMessages(require(scater))
suppressMessages(require(scran))
suppressMessages(require(Matrix))
Read the data and create a Seurat object
Here, we use the function Read10X_h5 to read in the expression matrices in R.
v3.1k <- Read10X_h5("pbmc_1k_v3_filtered_feature_bc_matrix.h5", use.names = T)
v2.1k <- Read10X_h5("pbmc_1k_v2_filtered_feature_bc_matrix.h5", use.names = T)
p3.1k <- Read10X_h5("pbmc_1k_protein_v3_filtered_feature_bc_matrix.h5", use.names = T)
## Genome matrix has multiple modalities, returning a list of matrices for this genome
# select only gene expression data from the CITE-seq data.
p3.1k <- p3.1k$`Gene Expression`
First, create Seurat objects for each of the datasets, and then merge into one large seurat object.
sdata.v2.1k <- CreateSeuratObject(v2.1k, project = "v2.1k")
sdata.v3.1k <- CreateSeuratObject(v3.1k, project = "v3.1k")
sdata.p3.1k <- CreateSeuratObject(p3.1k, project = "p3.1k")
# merge into one single seurat object. Add cell ids just in case you have overlapping barcodes between the datasets.
alldata <- merge(sdata.v2.1k, c(sdata.v3.1k,sdata.p3.1k), add.cell.ids=c("v2.1k","v3.1k","p3.1k"))
# also add in a metadata column that indicates v2 vs v3 chemistry
chemistry <- rep("v3",ncol(alldata))
chemistry[Idents(alldata) == "v2.1k"] <- "v2"
alldata <- AddMetaData(alldata, chemistry, col.name = "Chemistry")
alldata
## An object of class Seurat
## 33538 features across 2931 samples within 1 assay
## Active assay: RNA (33538 features, 0 variable features)
Check number of cells from each sample, is stored in the orig.ident slot of metadata and is autmatically set as active ident.
table(Idents(alldata))
##
## p3.1k v2.1k v3.1k
## 713 996 1222
1. Quality control
Seurat automatically calculates some QC-stats, like number of UMIs and features per cell. Stored in columns nCount_RNA & nFeature_RNA of the metadata.
head([email protected])
## orig.ident nCount_RNA nFeature_RNA Chemistry
## v2.1k_AAACCTGAGCGCTCCA-1 v2.1k 6631 2029 v2
## v2.1k_AAACCTGGTGATAAAC-1 v2.1k 2196 881 v2
## v2.1k_AAACGGGGTTTGTGTG-1 v2.1k 2700 791 v2
## v2.1k_AAAGATGAGTACTTGC-1 v2.1k 3551 1183 v2
## v2.1k_AAAGCAAGTCTCTTAT-1 v2.1k 3080 1333 v2
## v2.1k_AAAGCAATCCACGAAT-1 v2.1k 5769 1556 v2
Calculate mitochondrial proportion
We will manually calculate the proportion of mitochondrial reads and add to the metadata table.
percent.mito <- PercentageFeatureSet(alldata, pattern = "^MT-")
alldata <- AddMetaData(alldata, percent.mito, col.name = "percent.mito")
Calculate ribosomal proportion
In the same manner we will calculate the proportion gene expression that comes from ribosomal proteins.
percent.ribo <- PercentageFeatureSet(alldata, pattern = "^RP[SL]")
alldata <- AddMetaData(alldata, percent.ribo, col.name = "percent.ribo")
Now have another look at the metadata table
head([email protected])
## orig.ident nCount_RNA nFeature_RNA Chemistry
## v2.1k_AAACCTGAGCGCTCCA-1 v2.1k 6631 2029 v2
## v2.1k_AAACCTGGTGATAAAC-1 v2.1k 2196 881 v2
## v2.1k_AAACGGGGTTTGTGTG-1 v2.1k 2700 791 v2
## v2.1k_AAAGATGAGTACTTGC-1 v2.1k 3551 1183 v2
## v2.1k_AAAGCAAGTCTCTTAT-1 v2.1k 3080 1333 v2
## v2.1k_AAAGCAATCCACGAAT-1 v2.1k 5769 1556 v2
## percent.mito percent.ribo
## v2.1k_AAACCTGAGCGCTCCA-1 5.172674 25.84829
## v2.1k_AAACCTGGTGATAAAC-1 4.143898 20.81056
## v2.1k_AAACGGGGTTTGTGTG-1 3.296296 51.55556
## v2.1k_AAAGATGAGTACTTGC-1 5.885666 29.25936
## v2.1k_AAAGCAAGTCTCTTAT-1 2.987013 17.53247
## v2.1k_AAAGCAATCCACGAAT-1 2.010747 45.69249
Plot QC
Now we can plot some of the QC-features as violin plots
VlnPlot(alldata, features = "nFeature_RNA", pt.size = 0.1) + NoLegend()
VlnPlot(alldata, features = "nCount_RNA", pt.size = 0.1) + NoLegend()
VlnPlot(alldata, features = "percent.mito", pt.size = 0.1) + NoLegend()
VlnPlot(alldata, features = "percent.ribo", pt.size = 0.1) + NoLegend()
As you can see, the v2 chemistry gives lower gene detection, but higher detection of ribosomal proteins. As the ribosomal proteins are highly expressed they will make up a larger proportion of the transcriptional landscape when fewer of the lowly expressed genes are detected.
We can also plot the different QC-measures as scatter plots.
FeatureScatter(alldata, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
FeatureScatter(alldata, feature1 = "nFeature_RNA", feature2 = "percent.mito")
FeatureScatter(alldata, feature1="percent.ribo", feature2="nFeature_RNA")
We can also subset the data to only plot one sample.
FeatureScatter(alldata, feature1 = "nCount_RNA", feature2 = "nFeature_RNA",
cells = WhichCells(alldata, expression = orig.ident == "v3.1k") )
Filtering
Mitochondrial filtering
We have quite a lot of cells with high proportion of mitochondrial reads. It could be wise to remove those cells, if we have enough cells left after filtering. Another option would be to either remove all mitochondrial reads from the dataset and hope that the remaining genes still have enough biological signal. A third option would be to just regress out the percent.mito variable during scaling.
In this case we have as much as 99.7% mitochondrial reads in some of the cells, so it is quite unlikely that there is much celltype signature left in those.
Looking at the plots, make resonable decisions on where to draw the cutoff. In this case, the bulk of the cells are below 25% mitochondrial reads and that will be used as a cutoff.
#select cells with percent.mito < 25
idx <- which(alldata$percent.mito < 25)
selected <- WhichCells(alldata, cells = idx)
length(selected)
## [1] 2703
# and subset the object to only keep those cells
data.filt <- subset(alldata, cells = selected)
# plot violins for new data
VlnPlot(data.filt, features = "percent.mito")
As you can see, there is still quite a lot of variation in percent mito, so it will have to be dealt with in the data analysis step.
Gene detection filtering
Extremely high number of detected genes could indicate doublets. However, depending on the celltype composition in your sample, you may have cells with higher number of genes (and also higher counts) from one celltype.
In these datasets, there is also a clear difference between the v2 vs v3 10x chemistry with regards to gene detection, so it may not be fair to apply the same cutoffs to all of them.
Also, in the protein assay data there is a lot of cells with few detected genes giving a bimodal distribution. This type of distribution is not seen in the other 2 datasets. Considering that they are all pbmc datasets it makes sense to regard this distribution as low quality libraries.
Filter the cells with high gene detection (putative doublets) with cutoffs 4100 for v3 chemistry and 2000 for v2.
#start with cells with many genes detected.
high.det.v3 <- WhichCells(data.filt, expression = nFeature_RNA > 4100)
high.det.v2 <- WhichCells(data.filt, expression = nFeature_RNA > 2000 & orig.ident == "v2.1k")
# remove these cells
data.filt <- subset(data.filt, cells=setdiff(WhichCells(data.filt),c(high.det.v2,high.det.v3)))
# check number of cells
ncol(data.filt)
## [1] 2631
Filter the cells with low gene detection (low quality libraries) with less than 1000 genes for v2 and < 500 for v2.
#start with cells with many genes detected.
low.det.v3 <- WhichCells(data.filt, expression = nFeature_RNA < 1000 & orig.ident != "v2.1k")
low.det.v2 <- WhichCells(data.filt, expression = nFeature_RNA < 500 & orig.ident == "v2.1k")
# remove these cells
data.filt <- subset(data.filt, cells=setdiff(WhichCells(data.filt),c(low.det.v2,low.det.v3)))
# check number of cells
ncol(data.filt)
## [1] 2531
Plot QC-stats again
Lets plot the same qc-stats another time.
VlnPlot(data.filt, features = "nFeature_RNA", pt.size = 0.1) + NoLegend()
VlnPlot(data.filt, features = "nCount_RNA", pt.size = 0.1) + NoLegend()
VlnPlot(data.filt, features = "percent.mito", pt.size = 0.1) + NoLegend()
VlnPlot(data.filt, features = "percent.ribo", pt.size = 0.1) + NoLegend()
# and check the number of cells per sample before and after filtering
table(Idents(alldata))
##
## p3.1k v2.1k v3.1k
## 713 996 1222
table(Idents(data.filt))
##
## p3.1k v2.1k v3.1k
## 526 933 1072
Calculate cell-cycle scores
Seurat has a function for calculating cell cycle scores based on a list of know S-phase and G2/M-phase genes.
data.filt <- CellCycleScoring(
object = data.filt,
g2m.features = cc.genes$g2m.genes,
s.features = cc.genes$s.genes
)
## Warning: The following features are not present in the object: MLF1IP, not
## searching for symbol synonyms
## Warning: The following features are not present in the object: FAM64A, HN1, not
## searching for symbol synonyms
VlnPlot(data.filt, features = c("S.Score","G2M.Score"))
In this case it looks like we only have a few cycling cells in the datasets.