《生物信息学:导论与方法》--新一代测序NGS:转录组分析RNA-Seq--听课笔记(十七)

第八章 新一代测序NGS:转录组分析RNA-Seq

8.10 学生课堂报告-----Normalization methods for Illumina high-throughput RNA sequencing data analysis

1 Introduction

  • Terms in NGS: Flow cell; Lane; Library; Reads
  • Variation:
  1.  Intra-sample: gene length; GC content
  2.  Inter-sample: Library size(depth); Library composition
  • Normalization is Necessary
  • Hypothetical scenario: The hypothetical example above highlights the notion that the proportion of reads attributed to a given gene in a library depends on the expression properties of the whole sample rather than just the expression level of that gene.

2 Biological Assumptions

  • Gene counts are divided by the total number of mapped reads (or library size) associated with their lane and multiplied by the mean total count across all the samples of the dataset.
  • TC=\frac{Y_{gk}}{N_{k}}\times \bar{N_{k}} , Total count(TC) (具体可以参考阅读paper list中文献)
  • UQ=\frac{Y_{gk}}{N_{0.75k}}\times \bar{N_{k}}, Upper Quartile(UQ)
  • the total counts are replaced by the upper quartile of counts different from 0 in the computation of the normalization factors
  • Med=\frac{Y_{gk}}{N_{0.5k}}\times \bar{N_{k}}, Median(Med)
  • the total counts are replaced by the median counts different from 0 in the computation of the normalization factors
  • Quantile(Q): consists in matching distributions of gene counts across lanes. It is implemented in the Bioconductor package Limma by calling the Normalize Quantiles funcition.
  • RPKM=10^{9}\times \frac{C}{NL}, Reads Per Kilobase per Million(RPKM), C: the number of reads mapped onto the gene's exons; N:total number of reads in the experiment; L: the sum of the exons in base pairs.
  • Principle: faciliate comparisons between genes within a sample and combines between- and within- sample normalization
  • 《生物信息学:导论与方法》--新一代测序NGS:转录组分析RNA-Seq--听课笔记(十七)_第1张图片
  • 《生物信息学:导论与方法》--新一代测序NGS:转录组分析RNA-Seq--听课笔记(十七)_第2张图片
  • reference: Differential expression analysis for sequence count data

3 DE Comparision

  • TC、UQ、Med、DESeq、TMM、Q、RPKM、RawCount
  • distribution 比较
  • Intra-variance 比较
  • Houskeeping比较
  •  Clustering比较
  • Fasle-positive rate
  • Summary of normalization effeciency
  • reference:A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
  • 这篇reference的图都很值得借鉴。

8.14 学生课堂报告----Differential gene exression analysis

  • High-throughput sequencing technology is rapidly becoming the standard method for measuring RNA expression levels (aka RNA-seq)
  • One of the main goals of these experiments is to identify the differentially expressed genes in two or more conditions.
  • 做RNA转录水平的分析,先需要对样品进行RNA-seq得到一系列的datasets,再从这些datasets里面分析出不同条件下基因差异表达,一共有以下三步:
  1. Normalization of counts, 将得到的reads归一化处理。
  2. parameter estimation of the statistical model,通过评估一些参数选择合理的分析模型。
  3. test for differential gene expression,根据所选的模型去做基因差异表达分析。
  • reference:Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data
  • 这篇文章Comparasion of different analysis methods for RNA-seq data by different angels. Such as, Cuffdiff, edgeR, DESeq, PossionSeq, baySeq and limma
  • 所以这也是一篇类似综述的,看看别人是如何讲这篇文章的。
  • 这篇paper主要用的database有以下两个:
  1. The first is the Sequencing Quality Control (SEQC) dataset, which includes replicated samples of the human whole body reference RNA and human brain reference RNA along with RNA spike-in controls.
  2. The second dataset is RNA-seq data from biological replicates of three cell lines that were characterized as part of the ENCODE project.
  • 的确,如果你需要对比很多工具,那么就需要一个基础的数据库,大家都对这个数据库操作,然后才能对比效果。
  • 然后就是从他们的哪些角度去对比,The means of their analysis:
  • The analysis in this paper focused on a number of measures that are most relevant for detection of differential gene expression from RNA-seq data:
  1. nomalization of count data
  2. sensitivity and specificity of DE detection
  3. performance on the subset of genes that are expressed in one condition but have no detectable expression in the other condition.
  4. the effects of reduced sequencing depth and number of replicates on the detection of differential expression.
  • 然后开始解释各个图的含义,的确,这就是每篇paper最重要的东西,然后其实审稿人也是看图和表,正文内容不一定看得很细的。
  • real-time PCR结果
  • 用两个同样的样本,按理不能检测出差异,然后用各个工具来分析,对比效果。
  • 测序深度越深,重复的样本越多,它检测到的假阳性率越低。然后检测灵敏度会越高。
  • 测序深度提高和重复样本增加,后者更好。
  • Conclusion
  1. In most benchmarks Cuffdiff performed less favorably: with a higher number off false positives; without any increase in sensitivity(Cuffdiff方法不是特别好,它的假阳性概率比较高,并且它的测序灵敏度不是那么高)
  2. Our results conclusively demonstrate that the addition of replicate samples provides substantially greater detection power of DE than increased sequence depth.
  3. Hence, including more repicate samples in RNA-seq experiments is always to be preferred over increasing the number of seqeunced reads.
  • 所以你看这个结论展示,很清晰,就那么两点重要结论。并且通过颜色标注字体。然后展示的图,也是跟结论息息相关的。

你可能感兴趣的:(生物信息学)