基本概念:read,contig,scaffold,N50

英文内容搬运自:
https://bioinformaticsworkbook.org/introduction/dataTerminology.html

Learning Objective

  • base/nucleotide
  • read
  • contig
  • scaffold
  • chromosome

What is a base?

There are four common bases in DNA sequence, Adenine, Guanine, Cytosine and Thymine. Uracil is found in RNA in place of Thyamine

碱基

Image taken from wikipedia where more information about nucleotides can also be found.

What is a read?

A read is a string of bases represented by their one letter codes. Here is an example of a read that is 50 bases long. TTAACCTTGGTTTTGAACTTGAACACTTAGGGGATTGAAGATTCAACAACCCTAAAGCTTGGGGTAAAAC

What is a contig?

A contig is the consensus sequence generated by aligning reads to themselves.

contig

The last line is the consensus of the aligned reads. We call this consensus sequence a contig.

What is a scaffold?

A scaffold is a set of contigs that have been ordered and oriented based on mate pair or long distance information.

contigNNNNNNNNNNNNgitnocNNNNNNNNcontigNNNNNNNNcontigNNNNgitnoc

In the line above

  • contig is a string of of bases (ATC or G)
  • N is an unknown base
  • gitnoc is the word contig written backwards to represent the reverse complement of a contig

再搜文章一些补充,有图就更好了:

contig/scaffold 和 N50/N90
把测序的reads拼接,如果可以完全拼接起来,中间没有gap,则是contig.如果中间有gap,但是知道gap的长度,这样的序列称为scaffold.
contig N50 和scaffold N50
把contig或scaffold按照从大到小的顺序排列,长度达到基因组大小(所有contig或scaffold的长度)的50%时,那条contig/scaffold的长度,即为contig/scaffold N50. N50越大,说明基因组组装的质量越高。同理还有N90,即达到基因组大小90%时的contig/scaffold的长度。
作者:wo_monic
链接:https://www.jianshu.com/p/9876964e3d20
来源:
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

基因组组装一般分为三个层次,contig, scaffold和chromosomes. contig表示从大规模测序得到的短读(reads)中找到的一致性序列。组装的第一步就是从短片段(pair-end)文库中组装出contig。进一步基于不同长度的大片段(mate-pair)文库,将原本孤立的contig按序前后连接,其中会调整contig方向以及contig可能会存在开口(gap,用N表示),这一步会得到scaffolds,就相当于supercontigs和meatacontigs。最后基于遗传图谱或光学图谱将scaffold合并调整,形成染色体级别的组装(chromosome).
https://zhuanlan.zhihu.com/p/38317398

什么是Scaffold?基因组de novo测序,通过reads拼接获得Contigs后,往往还需要构建454 Paired-end库或Illumina Mate-pair库,以获得一定大小片段(如3Kb、6Kb、10Kb、20Kb)两端的序列。基于这些序列,可以确定一些Contig之间的顺序关系,这些先后顺序已知的Contigs组成Scaffold。Contig N50:Reads拼接后会获得一些不同长度的Contigs.将所有的Contig长度相加,能获得一个Contig总长度.然后将所有的Contigs按照从长到短进行排序,如获得Contig 1,Contig 2,contig 3...………Contig 25.将Contig按照这个顺序依次相加,当相加的长度达到Contig总长度的一半时,最后一个加上的Contig长度即为Contig N50.举例:Contig 1+Contig 2+ Contig 3 +Contig 4=Contig总长度1/2时,Contig 4的长度即为Contig N50.ContigN50可以作为基因组拼接的结果好坏的一个判断标准。Scaffold N50:Scaffold N50与Contig N50的定义类似.Contigs拼接组装获得一些不同长度的Scaffolds.将所有的Scaffold长度相加,能获得一个Scaffold总长度.然后将所有的Scaffolds按照从长到短进行排序,如获得Scaffold 1,Scaffold 2,Scaffold 3...………Scaffold 25.将Scaffold按照这个顺序依次相加,当相加的长度达到Scaffold总长度的一半时,最后一个加上的Scaffold长度即为Scaffold N50.举例:Scaffold 1+Scaffold 2+ Scaffold3 +Scaffold 4 +Scaffold 5=Scaffold总长度1/2时,Scaffold 5的长度即为Scaffold N50.Scaffold N50可以作为基因组拼接的结果好坏的一个判断标准。
作者:白羊铁蛋
链接:https://www.jianshu.com/p/117441ac6eb8
来源:
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。

What is a chromosome?

Chromosomes are the largest DNA molecules in a cell.
Scaffolds can be ordered and oriented using a genetic map or Hi-C data into linkage groups or chromosomes.
The ultimate goal of a genome assembly project is to assemble reads into phased chromosomes that represent an actual individual.
Most chromosomal assemblies produced today are not phased or may represent multiple individuals.

你可能感兴趣的:(基本概念:read,contig,scaffold,N50)