Cancer基因组研究常见三格式

FASTQ, BAM, VCF

FASTQ Format Specification

Introduction

FASTQ format stores sequences and Phred qualities in a single file. It is concise and compact. FASTQ is first widely used in the Sanger Institute and therefore we usually take the Sanger specification and the standard FASTQ format, or simply FASTQ format. Although Solexa/Illumina read file looks pretty much like FASTQ, they are different in that the qualities are scaled differently. In the quality string, if you can see a character with its ASCII code higher than 90, probably your file is in the Solexa/Illumina format.

FASTQ是基于文本的,保存生物序列(通常是核酸序列)和其测序质量信息的标准格式。其序列以及质量信息都是使用一个ASCII字符标示,最初由Sanger开发,目的是将FASTA序列与质量数据放到一起,目前已经成为高通量测序结果的事实标准。格式说明FASTQ文件中每个序列通常有四行:序列标识以及相关的描述信息,以‘@’开头;第二行是序列第三行以‘+’开头,后面是序列标示符、描述信息,或者什么也不加第四行,是质量信息,和第二行的序列相对应,每一个序列都有一个质量评分,根据评分体系的不同,每个字符的含义表示的数字也不相同。

BAM

当测序得到的fastq文件map到基因组之后,我们通常会得到一个sam或者bam为扩展名的文件。SAM的全称是sequence alignment/map format。而BAM就是SAM的二进制文件(B取自binary)。 那么SAM文件的格式是什么样子的呢?如果你想真实地了解SAM文件,可以查看它的说明文档。SAM由头文件和map结果组成。头文件由一行行以@起始的注释构成。而map结果是类似下面的东西:

HWI-ST1001:137:C12FPACXX:7:1115:14131:66670 0 chr1 12805 1 42M4I5M * 0 0 TTGGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCACCAATATG CCCFFFFFHHGHHJJJJJHJJJJJJJJJJJJJJJJIJJJJJJJJJJJJIJJ AS:i:-28 XN:i:0 XM:i:2 XO:i:1XG:i:4 NM:i:6 MD:Z:2C41C2 YT:Z:UU NH:i:3 CC:Z:chr15 CP:i:102518319 XS:A:+ HI:i:0
HWI-ST1001:137:C12FPACXX:7:2313:17391:30032 272 chr1 13494 1 51M * 0 0 ACTGCCTGGCGCTGTGCCCTTCCTTTGCTCTGCCCGCTGGAGACAGTGTTT CFFFFHHJJJJIJJJJIJJJJJJJJJJJJJJJJJJJJJHHHHFA+FFFC@B AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0NM:i:1 MD:Z:44G6 YT:Z:UU XS:A:+ NH:i:3 CC:Z:chr15 CP:i:102517626 HI:i:0
HWI-ST1001:137:C12FPACXX:7:1109:17518:53305 16 chr1 13528 1 51M * 0 0 CGCTGGAGCCGGTGTTTGTCATGGGCCTGGGCTGCAGGGATCCTGCTACAA #############AB=?:*B?;A?<2+233++;A+A2+<7==@7,A AS:i:-5 XN:i:0 XM:i:2 XO:i:0 XG:i:0NM:i:2 MD:Z:8A21T20 YT:Z:UU XS:A:+ NH:i:4 CC:Z:chr15 CP:i:102517592 HI:i:0

看上去很类似fastq文件,它也有read名称,序列,质量等信息,但是又不完全一样。首先,每个read只占一行,只是它被tab分成了很多列,一共有12列,分别记录了:1. read名称2. SAM标记3. chromosome4. 5′端起始位置5. MAPQ(mapping quality,描述比对的质量,数字越大,特异性越高)6. CIGAR字串,记录插入,删除,错配以及splice junctions(后剪切拼接的接头)7. mate名称,记录mate pair信息8. mate的位置9. 模板的长度10. read序列11. read质量12. 程序用标记
显然,其中chromosome至CIGAR的信息都是非常重要的。但是这些对我们不重要,我们只需要了解SAM/BAM文件是什么,就可以了。重要的是如果进行下游的操作。 要操作SAM/BAM文件,首先需要安装samtools。它的安装过程和所有的linux/unix程序一样,都是经过make之后生成可执行程序,然后把它的路径告知系统,或者放在系统可以找到的位置就可以了。

TCGA Variant Call Format (VCF) Specification

Variant Call Format (VCF) is a format for storing and reporting genomic sequence variations. VCF files are modular where the annotations and genotype information for a variant are separated from the call itself. As of May 2011, VCF version 4.1 (described here) is the most recent release. GSCs will generate sequence variation data using high-throughput sequencing technologies and resulting variations will be submitted to DCC as VCF files. TCGA has adopted VCF 4.1 with certain modifications to support supplemental information specific to the project. Subsequent sections describe the format TCGA VCF files should follow and validation steps that would have to be implemented at the DCC.

Any VCF file contains two main sections. The HEADER section contains meta-information for variant records that are reported as individual rows in the BODY of the VCF file. Both sections are described below.

你可能感兴趣的:(Cancer基因组研究常见三格式)