【注释-2】Annovar-2——Gene-based annotation

Gene-based annotation

ANNOVAR的一个功能是生成基于基因的注释。通过基因相关注释,可以知道变异位点在基因组上的位置和对蛋白质编码的影响。确定SNP或CNV是否会导致蛋白质编码变化和受影响的氨基酸。 可灵活使用RefSeq基因,UCSC基因,ENSEMBL基因,GENCODE基因,AceView基因或许多其他基因定义系统。

数据库下载

在进行注释之前,首先需要下载物种对应的数据库,以human为例,命令如下

annotate_variation.pl -downdb -buildver hg19 -webfrom annovar refGene humandb/

下载过程中的log信息如下:

NOTICE: Web-based checking to see whether ANNOVAR new version is available ... Done
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refGene.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refLink.txt.gz ... OK
NOTICE: Downloading annotation database http://www.openbioinformatics.org/annovar/download/hg19_refGeneMrna.fa.gz ... OK
NOTICE: Uncompressing downloaded files
NOTICE: Finished downloading annotation files for hg19 build version, with files saved at the 'humandb' directory

开始注释,输入文件格式为.avinput

annotate_variation.pl -out ex1 -build hg19 example/ex1.avinput humandb/

运行过程中的log信息如下

NOTICE: The --geneanno operation is set to ON by default
NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 48660 transcripts (including 10375 without coding sequence annotation) for 25588 unique genes
NOTICE: Reading FASTA sequences from humandb/hg19_refGeneMrna.fa ... Done with 14 sequences
WARNING: A total of 333 sequences will be ignored due to lack of correct ORF annotation
NOTICE: Finished gene-based annotation on 15 genetic variants in example/ex1.avinput
NOTICE: Output files were written to ex1.variant_function, ex1.exonic_variant_function

输出两个文件:ex1.variant_function and ex1.exonic_variant_function
想要改变文件名可以使用 -outfile参数

variant_function

这个文件在输入文件的前面,新加了两列,第一列代表变异位点在基因上的区域,比如外显子,内含子,基因间区等;第二列给出对应的基因;

cat ex1.variant_function 
UTR5 ISG15(NM_005101:c.-33T>C) 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
UTR3 ATAD3C(NM_001039211:c.*91G>T) 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
splicing NPHP4(NM_001291593:exon19:c.1279-2T>A,NM_001291594:exon18:c.1282-2T>A,NM_015102:exon22:c.2818-2T>A) 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
intronic DDR2 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays

第一列说明变体是否命中外显子或命中基因间区域,或命中内含子,或命中非编码RNA基因。 如果变体是exonic / intronic / ncRNA,则第二列给出基因名称(如果命中多个基因,则在基因名称之间添加逗号); 如果没有,第二列将给出两个相邻基因以及与这些相邻基因的距离.

annovar将基因组划分成了9种区间

exonic、 splicing、 ncRNA、UTR5、UTR3、intronic、upstream、downstream、intergenic

exonic特指编码蛋白的外显子区;
UTR5和UTR3特指不翻译蛋白的外显子区;
splicing指的是位于内含子边界(默认2bp以内)的区域;
ncRNA指的是非编码蛋白的基因区域;
intronic指的是内含子区;
upstream指的是转录起始位点上游1Kb以内的区域;
downstream指的是转录终止位点下游1kb以内的区域;
intergenic值的是基因间区

【注释-2】Annovar-2——Gene-based annotation_第1张图片

在判断一个变异位点所处区域时,以上9种区间的优先级是不同的:

exonic = splicing > ncRNA> > UTR5/UTR3 > intron > upstream/downstream > intergenic

exonic_variant_function

这个文件只对位于exonic区间的变异位点,给出对应的氨基酸变化信息。在输入文件的基础上新增了3列,第一列代表行数,第二列代表变异类型,第三列代表氨基酸的变化情况.


nonsynonymous SNV IL23R:NM_144701:exon9:c.G1142A:p.R381Q, 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
nonsynonymous SNV ATG16L1:NM_001190267:exon9:c.A550G:p.T184A,ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_001190266:exon9:c.A646G:p.T216A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_198890:exon5:c.A409G:p.T137A, 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease
nonsynonymous SNV NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_001293557:exon3:c.C2023T:p.R675W, 16 50745926 50745926 C comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
nonsynonymous SNV NOD2:NM_022162:exon8:c.G2722C:p.G908R,NOD2:NM_001293557:exon7:c.G2641C:p.G881R, 16 50756540 50756540 G comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
frameshift insertion NOD2:NM_022162:exon11:c.3017dupC:p.A1006fs,NOD2:NM_001293557:exon10:c.2936dupC:p.A979fs, 16 50763778 5076377comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
frameshift deletion GJB2:NM_004004:exon2:c.35delG:p.G12fs, 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss

annovar提供了以下几种变异类型

frameshift insertion、frameshift deletion、 frameshift block substitution、stopgain、 stoploss、nonframeshift insertion、 nonframeshift deletion、nonframeshift block substitution、nonsynonymous SNV、synonymous SNV、unknown

在定义变异类型时,
首先基于4种基本的变异类型,SNV, insertion, deletion, block substitution, 再结合其对蛋白编码的影响。对于SNV而言,引起了蛋白质变化的就是synonymous SNV, 蛋白质没有变化的就是
nonsynonymous SNV;
对于剩下的3种基本变异类型,在考虑对蛋白质的影响时,分为了移码frameshift和非移码nonframeshift 两种。stopgain指的是突变之后,原本的密码子变成了终止密码子,stoploss指的是突变之后,原本的终止密码子变成了普通密码子,导致翻译情况变化较大。unknown代表不清楚该变异对蛋白的影响。


【注释-2】Annovar-2——Gene-based annotation_第2张图片

ANNOVAR包现在包含hg19_refGeneWithVer.txt文件,以举例说明如何使用带有版本的refGene注释varians。 用户可以使用-dbtype refGeneWithVer,而不是使用-dbtype refGene,因此结果将包含带有版本的脚本标识符。 对于所有其他基因组构建,用户需要自己生成这些文件.

annovar还可以支持其他数据库注释,非人类物种还可以自己建库,可参照Gene-based annotation

转载请注明来源
作者:oddxix
微信公众号:oddxix

你可能感兴趣的:(【注释-2】Annovar-2——Gene-based annotation)