Single cell RNA-seq data analysis with R视频学习笔记（二）

接上面的学习笔记，这一讲是关于单细胞测序的质量控制，视频比较长，这篇写的也比较长，而且我基本用英文做的笔记，是把大部分的视频内容记录了下来（有的地方很啰嗦，我就直接写了中文，简化了一下）。因为这一部分是非常重要的。
视频地址：https://www.youtube.com/watch?v=rOm6UIPhHnc&list=PLjiXAZO27elC_xnk7gVNM85I2IQl5BEJN&index=2
课程官网：https://www.csc.fi/web/training/-/scrnaseq
另附关于质量控制的实战练习，可以从这里练习，下载练习数据，具体的代码也在里面：https://github.com/NBISweden/excelerate-scRNAseq/blob/master/session-qc/Quality_control.md

由于官网上只有youtube视频链接，和PPT，但是PPT没有解说内容，看了也比较空洞，所以我就把讲解的内容（不是全部）敲下来，以便以后查看。

废话不多说，下面开始学习：
第二讲：单细胞测序的质量控制

Now I will talk about one of the most important steps when you do single cell analysis. Because every time you do single cell RNA-seq whether it's SMART-seq2 or 10x, you will have fake libraries, you will have some half-dead cells, you have doublets. You really need to look at your data, filter your data before you start clustering. So I think it's really important not to rush through quality control too fast.

I will talk a bit first and go through the different steps of how you do single cell experiments and talk about the issues at these different points and a bit about filtering themselves.

But before we start, actually I want to talk about how is the transcriptional bursting because this is also really important for understanding single cell data. But in each single cell, a gene is not constantly "on" most of the time. Expression happens in burst, so basically, the transcriptional machinery bind to a gene its starts producing mRNA , and its circles around the gene locus and produces a lot of mRNA. Then it falls off. (left up corner A) Here is an illustration from an experiment, where you basically see in black is a gene is "on", you start producing mRNA, and later they degraded. And once the mRNA is produced, you also start producing protein. So when you master protein abundance in each single cell, it will vary across time. (right panel A) But the mRNA is much much more spiky, so basically an experiment when each blue line here is a gene turned on（右下角蓝色峰）.

So basically if you have a population of cells from the same cell type, here with only 4 genes. We vary in size. Some genes is not there and some genes are quite abundant in individual cells. When you do bulk RNA-seq, so you get average of these. But if you do single cell, of course you get some bias. And you do reverse transcription, and here you also loss a lot of transcripts. You will just detect 10% percent or 40% percent of transcripts depending on what technique you are using. And then you amplify it where you might introduce bias and also this selection is not completely random. There will be preference for some transcripts to be reverse transcripted more easily than others. So there's a lot of sources of dropout or missing genes that gives all of these zeros in your data frame on the end.

And then of course you end up with these datasets if you used to working on bulk-RNA-seq,you will find the single cell RNA-seq is looks quite crappy. We have a lot of batch effect in single cell RNA-seq data.

I try to visualize you from raw data to gene expression metrix. You do quality control before you go into through, doing normalization, removing batch effects. You try to do clustering, and try to visualize and so on. And QC is always the first step once you have gene expression metrix.

The first step is cell dissociation. Single cell capture.

This is I would say the most critical step in single cell RNA-seq. It's the biggest contribution to batch effects. And this's really important to have whole healthy cells. If they are hard to dissociated, you can do laser capture. If you have too harsh conditions , you damage the cells, you will have leakage RNA from these damage cell types that will give you a background signal, and they will always give a bias to your data. 如果你从组织里提取样品做bulk-RNA seq和单细胞测序，你可能得到的细胞比例是不一样的，因为单细胞测序对于细胞状态较好的样品比较“友好”。

This is a study where they show the induction of some genes during dissociations ,so they basically stain cells here before and after dissociation. You can see there a lot of genes are upregulated and they show you can get artificial clusters or cells that are dissociation induced effect.

As I said, you might get ambient RNA. This work is with immune cells where you have three different time points. And here is different cell types. As you can see all the samples from time point three sort of have a background of a neutrophil genes. We think that probably at the day3 we have more neutrophils but also a lot of them broken. We have a lot of RNA from that.

The single cell capture we talk about we can do FACS sort, we can do droplet-based, and you always get doublets, you will get empty wells.（意思是无论什么方法都有可能有doublet，也有可能有空孔）

Here we have human cells and mouse cells, and we have some cells that have reads form both species. And the estimates are there is a linear increase with the number of cells you load and the number of doublets you get. （所以如果你只load870 个左右的细胞，你大概只有0.4%的细胞是doublets的。如果你要用1万多个细胞，那么你将有8%左右的doublets。）

Another problem is that if you have damaged cells, a lot of doublets you get actually cell debris from one cell sticking to another cell. This is a dataset that I work with where I did clustering and I'm finding cells that have signature from two clusters and call them doublets. And as you see, they had a very bad experimental design. We have some samples with few like around thousand cells , and some up here with 17,000 cells. So it's also not advised. Because normalization and everything will be different produce datasets , even though we seq this one much deeper. We don't really reach saturation here to the same level. But we also see the more cells , the more doublets.

上面左边的这张图，两个群之间的连接部分（蓝色的），看起来这部分细胞有两边细胞群的signature。如果是研究分化的实验，有可能左边的细胞群经过分化，成为右边的细胞群，这时候你就无法区分究竟是分化过程的中间体，还是你的doublets。

So, how do we find doublets? This is the hardest thing.（这时有人提问，是否在测序前把细胞固定会提高测序的质量，主讲人说貌似没有文章证明这个方法有效，但是如果你的tissue非常难分离，或许这是个good idea。）

这里有一些检测doublets的方法。That's very tricky when you looking into cellular differentiation. Hopefully, during differentiation, if you go from one state to one state, you will have some genes coming up here in the middle during differentiation time. So you can distinguish what's being differentiation to just doublets being a mix of these two signatures. But it's bad if signal is weak, you don't really know what it is. What you are looking for it can be hard. So I plan to have less doublets if you are looking into differentiation.

Once we sorted them out, we need to lysis the cells.（不同的细胞类型可能裂解的方法也不一样，比如植物细胞的话，还需要破碎细胞壁。）What I'v seen also depending on the cell you may or may not get nuclear lysis of your cells. That will give a clearly different transcriptional landscape because the nuclear transcripts will be detected or not depending on if you have nuclear lysis.

And then reverse transcription, of course, is a big limited step.

同理，扩增的过程也会引入bias.

测序的机器也不是完美的，它们工作的时候也会引入bias。比如有时会有air bubbles， contaminations.

You can not use spike-in in 10× method. Because as you said most of the droplets are empty and if you throw spike-in into everything and then we mainly just sequencing empty droplets with spike-in, and that will cost a lot before we actually manager to find the cells. So our two kits used with artificial sequences that we add to the cells, and you should add it into lysis buffer so that they are going through all these steps as the cell's RNA as they are being processed. The idea is that these two cells looks like have identical transcriptional landscapes. But if you add spike-in , you can actually see that you have twice as much RNA as this one compare to this one.

Spike-in也不是完美的。We do have a difference in amplification bias because these ERCC have shorter transcripts amplify better. But we can still sort of estimate back the number of molecules RNA using the spike-in . We can use it to model technical noise, drop-out reads, so basically for each experiment you can see how efficient the reverse transcription was in single well.

不同的文献里可能使用的过滤细胞和基因的标准是不一样的。

一般来讲，如果你使用SMART-seq2方法测序，需要看一下以上这4 个方面，来评估一下你的测序质量。

When it comes to spike-in, if you don't even amplify these spike-in and undetect it in your library prep, you know you can't use it. You can throw it away that library. The relative proportion of spike-in to RNA can tell you something about how much RNA you have in cells.So you can use that for filtering. （如果你用的细胞很大，可以多加一些spike-in，加多少spike-in取决于你的细胞类型，所以最好做之前查询一下别人都是加了多少的spike-in）

Bigger cells have bigger ratio of RNA to spike-in. They also have higher number of detected genes.

也有很多人过滤基因的时候回把线粒体基因过滤掉。因为如果你的细胞破损了，RNA会漏出来，但是线粒体仍然在那里，所以你会得到很多线粒体的reads。一般来说，如果线粒体的read比例太高，说明你的细胞有些问题。

核糖体RNA也与你的data quality有关，但是需要注意的是核糖体RNA在不同细胞类型里的比例是不一样的，过滤的时候需要注意。

如果你使用SMART-seq2, 你还可以看一下3'端的bias，如果你的RNA降解了，就会像右边这个图一样。你的reads就只会集中在基因的3'端。这种细胞就必须要过滤掉。

质控这里需要注意的东西很多，但是上面3个粗体标注的是特别要注意的方面。

So how to do the filter. I say look at the distributions.

image.png

You can also just take all you different quality control metrics that you have, and throw them into a big metrics, and do PCA, and find outliers in the PCA actually that's a quite good way to find outliers. But it might also be that the outliers you get in the PCA are your small cluster.

In Drop-seq data or 10× data, a lot of people are using number of molecules. Cutting both ends.

As I said ,it is not a easy to filter. You have to know your data, you have to know how heterogenous what kind of cell types and sizes you expect from your dataset when you apply your filtering.

Another thing is quality control of genes,

Here is a plot with a contribution of total counts per gene and you see here clearly case with too much spike-in genes(ERCC).上面这种情况可能是细胞膜破碎导致的测序质量不太好。还有就是有些基因的reads占总reads的20%甚至更多，最好在clustering之前去掉它们。

As we said, batch effect will happen. And I think it quite important to take all your quality control metrics and plot them per batch.

So you can sort of look at unique map in your different sample that you can also understand what batch effects are.

I think PCA is really important because PCA gives you really good understanding of your data what's the main variability in your data.

You can also use PCA to check batch effects. You can look at your different batch and how well they correlated to different principle components to start understanding how much batch effects you have and what they are. Because you can also go into the loading, the PC1 ,for instance here, and see what are the genes that are driving this separation of these two datasets.

主讲人提到通常在分析完所有数据后，她会再次回到QC这一步来进行检查，因为你不知道你的clustering分群是真的两个不同的细胞群，还是因为你QC这一步没做好导致的。

Single cell RNA-seq data analysis with R视频学习笔记（二）

你可能感兴趣的:(Single cell RNA-seq data analysis with R视频学习笔记（二）)