【scRW】[4]Single-cell RNA-seq: Quality control set-up

Learning Objectives:

Understand how to bring in data from single-cell RNA-seq experiments

Quality control set-up

1. Exploring the example dataset

在本次研讨会上，我们将使用单细胞RNA-seq数据集，该数据集是Kang等人（2017年）进行的一项较大研究的一部分。在本文中，作者提出了一种利用遗传变异（eQTL）来确定基因的计算算法。每个包含单个细胞的液滴的遗传同一性（单一），并鉴定包含来自不同个体的两个细胞的液滴（双重）。
用于测试其算法的数据包括从八名狼疮患者中提取的合并的外周血单个核细胞（PBMC），分为对照和干扰素β治疗（刺激）条件。

[图片上传失败...(image-e2393b-1590137860500)]*
](https://upload-images.jianshu.io/upload_images/11904209-6ba372a42250af69.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

2. Raw data

This dataset is available on GEO (GSE96583), however the available counts matrix lacked mitochondrial reads, so we downloaded the BAM files from the SRA (SRP102802). These BAM files were converted back to FASTQ files, then run through Cell Ranger to obtain the count data that we will be using.

NOTE: The counts for this dataset is also freely available from 10X Genomics and is used as part of the Seurat tutorial.

3. Metadata

除了原始数据，我们还需要收集有关数据的信息；这就是所谓的 metadata元数据 。通常有一种 temptation ，就是只是开始探索数据，但如果我们不知道该数据所源自的样本，那并不是很有意义。

以下提供了我们数据集的一些相关metadata元数据：

The libraries were prepared using 10X Genomics version 2 chemistry;
The samples were sequenced on the Illumina NextSeq 500;
将来自八名狼疮患者的PBMC样本分别分成两等份;
= 用100 U / mL重组IFN-β激活一份PBMC，持续6小时。
= The second aliquot分试样 was left untreated.
= 6小时后，将每种条件的8个样品一起收集到两个最终的收集池中（受激细胞和对照细胞）。我们将使用这两个合并的样本。
分别鉴定了12138和12167个细胞（去除双峰后）作为对照样品和刺激后的合并样品。
由于样本是PBMC，因此我们期望有免疫细胞，例如：
= B cells
= T cells
= NK cells
= monocytes
= macrophages
= possibly megakaryocytes

建议您在执行QC之前最好知道预期的细胞类型。因为这样当任何细胞类型的低复杂度（只有部分基因的转录本）或线粒体表达水平较高的细胞可以帮助我们提出剔除QC不合格的细胞群体。

预期上述细胞类型都不应该具有低复杂度或具有高线粒体含量。

4. Setting up the R environment

研究涉及大量数据的最重要部分之一就是如何最好地对其进行管理。我们倾向于对分析进行优先级划分，但是数据管理中还有许多其他重要方面经常被人们忽略，因此他们对新数据没有先见之明。 HMS Data Management Working Group
深入讨论了除数据创建和分析之外要考虑的一些事项。

数据管理的一个重要方面是组织。对于您进行和分析数据的每个实验，通过 a planned storage space (directory structure) 来进行组织是最佳实践。我们将对单细胞分析进行此操作。

打开 RStudio 并创建一个名为single_cell_rnaseq的新R项目。然后，创建以下目录：

single_cell_rnaseq/
├── data
├── results
└── figures

5. Download data & New script

Control sample
Stimulated sample

# February 2020
# HBC single-cell RNA-seq workshop

# Single-cell RNA-seq analysis - QC

将Rscript保存为quality_control.R。您的工作目录应如下所示：

New script

6.Loading libraries

Now, we can load the necessary libraries:

# Load libraries
library(SingleCellExperiment)
library(Seurat)
library(tidyverse)
library(Matrix)
library(scales)
library(cowplot)
library(RCurl)

7.Loading single-cell RNA-seq count data

Regardless of the technology or pipeline used to process your single-cell RNA-seq sequence data, the output will generally be the same. That is, for each individual sample you will have the following three files:

1.a file with the cell IDs, representing all cells quantified
2.a file with the gene IDs, representing all genes quantified
3.a matrix of counts per gene for every cell

We can explore these files by clicking on the data/ctrl_raw_feature_bc_matrix folder:

7.1 `barcodes.tsv`

这是一个文本文件，其中包含该样品存在的所有细胞条形码。条形码以矩阵文件中显示的数据顺序列出（即，这些是列名; i.e. these are the column names）。

barcodes.tsv

7.2. `features.tsv`

This is a text file which contains the identifiers of the quantified genes. The source of the identifier can vary depending on what reference (i.e. Ensembl, NCBI, UCSC) you use in the quantification methods, but most often these are official gene symbols. The order of these genes corresponds to the order of the rows in the matrix file (i.e. these are the row names).

image.png

7.3 `matrix.mtx`

这是一个文本文件，其中包含计数值矩阵。行与上面的基因ID相关联，列与细胞条形码对应。请注意，此矩阵中有许多零值。

matrix.mtx

Loading this data into R requires us to use functions that allow us to efficiently combine these three files into a single count matrix. However, instead of creating a regular matrix data structure, the functions we will use create a sparse matrix to improve the amount of space, memory and CPU required to work with our huge count matrix.

Different methods for reading in data include:

readMM(): This function is from the Matrix package and will turn our standard matrix into a sparse matrix. The features.tsv file and barcodes.tsv must first be individually loaded into R and then they are combined. For specific code and instructions on how to do this please see our additional material.
Read10X(): This function is from the Seurat package and will use the Cell Ranger output directory as input. In this way individual files do not need to be loaded in, instead the function will load and combine them into a sparse matrix for you. We will be using this function to load in our data!

7.4 Reading in a single sample (`read10X()`)

When working with 10X data and its proprietary software Cell Ranger, you will always have an outs directory. Within this directory you will find a number of different files including:

web_summary.html: report that explores different QC metrics, including the mapping metrics, filtering thresholds, estimated number of cells after filtering, and information on the number of reads and genes per cell after filtering.
BAM alignment files: files used for visualization of the mapped reads and for re-creation of FASTQ files, if needed
filtered_feature_bc_matrix: folder containing all files needed to construct the count matrix using data filtered by Cell Ranger
raw_feature_bc_matrix: folder containing all files needed to construct the count matrix using the raw unfiltered data

We are mainly interested in the raw_feature_bc_matrix as we wish to perform our own QC and filtering while accounting for the biology of our experiment/biological system.

If we had a single sample, we could generate the count matrix and then subsequently create a Seurat object:

# How to read in 10X data for a single sample (output is a sparse matrix)
ctrl_counts <- Read10X(data.dir = "data/ctrl_raw_feature_bc_matrix")

# Turn count matrix into a Seurat object (output is a Seurat object)
ctrl <- CreateSeuratObject(counts = ctrl_counts,
                           min.features = 100)

min.features 参数指定每个细胞需要检测的最小基因数量。此参数将过滤掉质量较差的单元，这些单元可能只是封装了随机条形码 random barcodes 而没有任何单元any cell present。通常，检测不到100个基因的细胞不考虑进行分析。

Seurat automatically creates some metadata for each of the cells when you use the Read10X() function to read in data. This information is stored in the meta.data slot within the Seurat object (see more in the note below).

The Seurat object is a custom list-like object 自定义列表状对象 that has well-defined spaces to store specific information/data. You can find more information about the slots in the Seurat object at this link.

# Explore the metadata
head([email protected])

What do the columns of metadata mean?

orig.ident: this often contains the sample identity if known, but will default to “SeuratProject”
nCount_RNA: number of UMIs per cell
nFeature_RNA: number of genes detected per cell

7.5 Reading in multiple samples with a `for loop`

In practice, you will likely have several samples that you will need to read in data for using one of the 2 functions we discussed earlier (Read10X() or readMM()). So, to make the data import into R more efficient we can use a for loop, that will interate over a series of commands for each of the inputs given.

In R, it has the following structure/syntax:

## DO NOT RUN

for (variable in input){
    command1
    command2
    command3
}

The for loop we will be using today will iterate over the two sample “files” and execute two commands for each sample - (1) read in the count data (Read10X()) and (2) create the Seurat objects from the read in data (CreateSeuratObject()):

# Create each individual Seurat object for every sample
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
        seurat_data <- Read10X(data.dir = paste0("data/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file)
        assign(file, seurat_obj)
}

We can break down the for loop to describe the different lines of code:

Step1: specify input

For this dataset, we have two samples that we would like to create a Seurat object for:
ctrl_raw_feature_bc_matrix
stim_raw_feature_bc_matrix

We can specify these samples in the input part for our for loop as elements of a vector using c(). We are assigning these to a variable and we can call that variable anything we would like (try to give it a name that makes sense). In this example, we called the variable file.

During the execution of the above loop, file will first contain the value “ctrl_raw_feature_bc_matrix”, run through the commands all the way through to assign(). Next, it will contain the value “stim_raw_feature_bc_matrix” and once again run through all the commands. If you had 15 folders as input, instead of 2, the above code will run through 15 times, for each of your data folders.

## DO NOT RUN

# Create each individual Seurat object
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
}

Step 2: Read in data for the input

We can continue our for loop by adding a line to read in data with Read10X():

Here, we need to specify the path to the file, so we will prepend the data/ directory to our sample folder name using the paste0() function.

Step 3: Create Seurat object from the 10X count data

Now, we can create the Seurat object by using the CreateSeuratObject() function, adding in the argument project, where we can add the sample name.

## DO NOT RUN

        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file)

Step 4: Assign Seurat object to a new variable based on sample

The last command assigns the Seurat object created (seurat_obj) to a new variable. In this way, when we iterate and move on to the next sample in our input we will not overwrite the Seurat object created in the previous iteration:

## DO NOT RUN
  
        assign(file, seurat_obj)
}

Now that we have created both of these objects, let’s take a quick look at the metadata to see how it looks:

# Check the metadata in the new Seurat objects
head([email protected])
head([email protected])

Next, we need to merge these objects together into a single Seurat object. This will make it easier to run the QC steps for both sample groups together and enable us to easily compare the data quality for all the samples.

We can use the merge() function from the Seurat package to do this:

# Create a merged Seurat object
merged_seurat <- merge(x = ctrl_raw_feature_bc_matrix, 
                       y = stim_raw_feature_bc_matrix, 
                       add.cell.id = c("ctrl", "stim"))

Because the same cell IDs can be used for different samples, we add a sample-specific prefix to each of our cell IDs using the add.cell.id argument. If we look at the metadata of the merged object we should be able to see the prefixes in the rownames:

# Check that the merged object has the appropriate sample-specific prefixes
head([email protected])
tail([email protected])