生信知识补充

目前研究转录组的方法主要三种：
（1）基于杂交技术的cDNA芯片和寡聚核苷酸芯片；
（2）基于sanger测序法的SAGE (serial analysis of gene expression)、LongSAGE和MPSS(massively parallel signature sequencing)；
（3）基于第二代测序技术的转录组测序，又称为RNA-Seq。

其中sanger测序的数据并不多见，GEO上以芯片数据和二代测序数据（简称测序数据）这两大类居多。
GEO上常见的芯片数据，一般是寡聚核苷酸芯片中的in situ oligonucleotide和spotted oligonucleotide，以及cDNA芯片中的spotted DNA/cDNA。其中Affymetrix的芯片很多都是in situ oligonucleotide。芯片公司还有很多家，如Agilent、Applied Biosystems（AB）等。
芯片数据得到的是信号强度值（非整数），这东西和counts不同，采用的分析流程也有所区别。另外，芯片数据使用探针来标记基因，不同平台标记的编号不同，因此需要使用相应的GPL文件进行注释（当然很多平台的注释数据库被写成了工具包，可从Bioconductor安装）

二代测序数据也有不同的 workflow ，根据使用的处理软件不同而不同，如TCGA上有

image.png

其中以HTSeq居多。还有一种常见的workflow叫 RSEM 数据（使用RSEM算法估计的表达量）。

RSEM是一个神奇的东西，如下引用一些关于RSEM的解释。
RSEM（RNA-Seq by Expectation-Maximization）是使用EM算法对表达量进行估算的方法。解决的主要问题是：由于可变剪切等原因，部分reads可能mapping到多个转录本上，使得counts定量不确定。经典的Alexa-seq算法只比对到一个参考位置上的reads数量计算表达量。而RSEM方法采用EM算法进行估计定量（RSEM是在2010年发表的）。
RSEM流程得到的数据应当使用EBSeq工具包进行差异分析，而不推荐DESeq或edgeR等。

image.png

RSEM^1,2 is an RNA-Seq transcript quantification program developed in 2009. You need a server with Linux/Mac OS. To run RSEM, your server should have C++, Perl and R installed. In addition, you need at least one aligner to align RNA-Seq reads for you. RSEM can call Bowtie, Bowtie 2 or STAR for you if you have them installed. Last but not least, you need to install the latest version of RSEM.

另外文献（RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome）中有这么一段：
The primary output of RSEM consists of two files, one for isoform-level estimates, and the other for gene-level estimates. Abundance estimates are given in terms of two measures. The first is an estimate of the number of fragments that are derived from a given isoform or gene. We can only estimate this quantity because reads often do not map uniquely to a single transcript. This count is generally a non-integer value and is the expectation of the number of alignable and unfiltered fragments that are derived from a isoform or gene given the ML abundances. These (possibly rounded) counts may be used by a differential expression method such as edgeR [9] or DESeq [8]. The second measure of abundance is the estimated fraction of transcripts made up by a given isoform or gene. This measure can be used directly as a value between zero and one or can be multiplied by 106 to obtain a measure in terms of transcripts per million (TPM). The transcript fraction measure is preferred over the popular RPKM [18] and FPKM [6] measures because it is independent of the mean expressed transcript length and is thus more comparable across samples and species [7].

RSEM推荐参考资料
RSEM文档
使用RSEM进行差异表达分析
Alignment-based的转录本定量-RSEM
转录组分析学习笔记
TCGA中RSEM问题探讨
转录组分析流程——STAR+RSEM+Deseq2

关于RNA-Seg其他的推荐资料
RNA_seq_Biotrainee

生信知识补充

你可能感兴趣的:(生信知识补充)