基础分子结构(1): exon, intron, CDS, ORF, 3',5'-UTR

分子生物学是生物学的分支，它致力于在分子层面对生命进行研究。换句话说，它主要聚焦于对DNA、RNA和蛋白质及它们之间的相互作用。分子生物学的基础知识对正确运用生物信息学工具有着举足轻重的作用。因此我想对其中重要的概念进行学习。在这个篇章中，我将简单探讨一些重要的生物分子结构部分，包括exon, intron, ORF, 3'UTR, and 5'UTR，我会先给出一些官方文档中对他们的解释，然后换成自己的理解进行表达，也欢迎大家交流意见。

DNA,pre-mRNA & mature RNA

(图片来源：https://www.ibric.org/myboard/read.php?Board=exp_qna&id=536491)

1. Exon

An exon is any part of a gene that will form a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. The term exon refers to both the DNA sequence within a gene and to the corresponding sequence in RNA transcripts. In RNA splicing, introns are removed and exons are covalently joined to one another as part of generating the mature RNA. Just as the entire set of genes for a species constitutes the genome, the entire set of exons constitutes the exome.

无论是外显子还是下面的内含子，都是针对真核生物而言的。外显子(exon)是DNA上的编码区域，它可以被转录形成前体RNA(pre-mRNA)，并在RNA剪切后的成熟RNA中留存。因此，这个术语也能够被用于其对应的成熟RNA中的序列。

2. CDS

The coding region of a gene, also known as the coding sequence (CDS), is the portion of a gene's DNA or RNA that codes for protein.[1] Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes.[2] This can further assist in mapping the human genome and developing gene therapy.[3]

CDS(Coding Sequence)即为编码序列，进一步说就是它是可以编码为蛋白质的区域。对于真核生物，可能这会产生疑惑，exon不也是这样的功能？我们知道由于可变剪切，一些exon并非会被保留，那么剩下的留存下来的用以指导蛋白合成的的序列就是CDS序列，这个概念无论在DNA水平上还是RNA水平上都成立。

3. Intron

An intron is any nucleotide sequence within a gene that is not expressed or operative in the final RNA product. The word intron is derived from the term intragenic region, i.e. a region inside a gene.[1] The term intron refers to both the DNA sequence within a gene and the corresponding RNA sequence in RNA transcripts.[2] The non-intron sequences that become joined by this RNA processing to form the mature RNA are called exons.[3]

内含子(Intron)是DNA区域上能够转录为pre-mRNA但是却会在RNA剪切过程中移除的基因中的非编码序列。同样的，pre-mRNA中的对应序列也可以被叫做intron。事实上，在剪切过后的成熟mRNA中仍然存在有非编码序列，这点留在后文详述。

4. ORF

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open" (the "reading", however, refers to the RNA produced by transcription of the DNA and its subsequent interaction with the ribosome in translation). Such an ORF may[1] contain a start codon (usually AUG in terms of RNA) and by definition cannot extend beyond a stop codon (usually UAA, UAG or UGA in RNA).[2] That start codon (not necessarily the first) indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.[3]
In eukaryotic genes with multiple exons, introns are removed and exons are then joined together after transcription to yield the final mRNA for protein translation. In the context of gene finding, the start-stop definition of an ORF therefore only applies to spliced mRNAs, not genomic DNA, since introns may contain stop codons and/or cause shifts between reading frames. An alternative definition says that an ORF is a sequence that has a length divisible by three and is bounded by stop codons.[1][4] This more general definition can be useful in the context of transcriptomics and metagenomics, where a start or stop codon may not be present in the obtained sequences. Such an ORF corresponds to parts of a gene rather than the complete gene.

ORF实际上是开放阅读框(Open Reading Frame)的缩写。它指代了一段可编码肽链或是蛋白质的序列。因此，由于原核生物没有内含子，它指代的就是从起始密码子到终止密码子之间的一段序列，这对于DNA和RNA都成立。然而由于真核生物含有内含子，而内含子中也许也存在有终止密码子。所以ORF被定义为是在成熟mRNA中从起始密码子到终止密码子之间的序列。

5. 3’-UTR

In molecular genetics, the three prime untranslated region (3′-UTR) is the section of messenger RNA (mRNA) that immediately follows the translation termination codon. The 3′-UTR often contains regulatory regions that post-transcriptionally influence gene expression.
During gene expression, an mRNA molecule is transcribed from the DNA sequence and is later translated into a protein. Several regions of the mRNA molecule are not translated into a protein including the 5' cap, 5' untranslated region, 3′ untranslated region and poly(A) tail. Regulatory regions within the 3′-untranslated region can influence polyadenylation, translation efficiency, localization, and stability of the mRNA.[1][2] The 3′-UTR contains binding sites for both regulatory proteins and microRNAs (miRNAs). By binding to specific sites within the 3′-UTR, miRNAs can decrease gene expression of various mRNAs by either inhibiting translation or directly causing degradation of the transcript. The 3′-UTR also has silencer regions which bind to repressor proteins and will inhibit the expression of the mRNA.
Many 3′-UTRs also contain AU-rich elements (AREs). Proteins bind AREs to affect the stability or decay rate of transcripts in a localized manner or affect translation initiation. Furthermore, the 3′-UTR contains the sequence AAUAAA that directs addition of several hundred adenine residues called the poly(A) tail to the end of the mRNA transcript. Poly(A) binding protein (PABP) binds to this tail, contributing to regulation of mRNA translation, stability, and export. For example, poly(A) tail bound PABP interacts with proteins associated with the 5' end of the transcript, causing a circularization of the mRNA that promotes translation.
The 3′-UTR can also contain sequences that attract proteins to associate the mRNA with the cytoskeleton, transport it to or from the cell nucleus, or perform other types of localization. In addition to sequences within the 3′-UTR, the physical characteristics of the region, including its length and secondary structure, contribute to translation regulation. These diverse mechanisms of gene regulation ensure that the correct genes are expressed in the correct cells at the appropriate times.

3'UTR(Untraslated Region)区域是位于mRNA上和DNA上的一个区域，它位于ORF的下游，并结束于多聚腺苷酸尾部。它虽然不行使编码功能，但是却始终伴随ORF左右，并可以负责指导mRNA的定位，影响它的稳定以及翻译的有效性。例如，miRNA可以结合此区域以降解mRNA，影响基因表达；含有沉默子序列，以阻止基因表达。

6. 5‘-UTR

The 5′ untranslated region (also known as 5′ UTR, leader sequence, transcript leader, or leader RNA) is the region of a messenger RNA (mRNA) that is directly upstream from the initiation codon. This region is important for the regulation of translation of a transcript by differing mechanisms in viruses, prokaryotes and eukaryotes. While called untranslated, the 5′ UTR or a portion of it is sometimes translated into a protein product. This product can then regulate the translation of the main coding sequence of the mRNA. In many organisms, however, the 5′ UTR is completely untranslated, instead forming a complex secondary structure to regulate translation.
The 5′ UTR has been found to interact with proteins relating to metabolism, and proteins translate sequences[clarification needed] within the 5′ UTR. In addition, this region has been involved in transcription regulation, such as the sex-lethal gene in Drosophila.[1] Regulatory elements within 5′ UTRs have also been linked to mRNA export.[2]

与3’-UTR定义类似，但是5'-UTR位于ORF上游与转录起始位点之间。它对于mRNA的稳定维持同样重要，我们最为熟知的莫过于5’-cap，即对5’-UTR处碱基进行修饰，例如7-甲基鸟苷帽。这种修饰能够增强mRNA稳定性，协助mRNA运输出核，促进翻译起始。

以上英文内容来自维基百科。