Part1: gnomAD-LOEUF数据下载

上一篇文章已经讲解了gnomAD的flagship paper，文章中新建的评估基因对pLoF突变忍受力的模型——LOEUF，想必大家肯定都很想赶紧用起来。

这个模型的数据结果即可以在文章的supplymentary中找到（就是supplementary_dataset_11_full_constraint_metrics.tsv.gz）。也可以爬上梯子，在gnomAD的官网 → 右上角Downloads → gnomad.v2.1.1系列的Constraint，链接到相应位置进行下载，官网提供了：①只有经典转录本的gene list表格、②包含了多个转录本的transcript list表格（和文章附表内容相同）、③按人群数量做了降采样的E.O值。

Part2: 表格数据意义（翻译一下）

以full_constraint_metrics，即lof_metrics.by_transcript为例。
该文档共80950行，其中包含了19600+个基因的经典转录本和其他常见转录本。共78列，每列的header解释如下（详见supplementary information文档第74页）：

gene: Gene name，基因名称

transcript: Ensembl transcript ID (Gencode v19)，转录本编号

canonical: Boolean indicator as to whether the transcript is the canonical transcript for the gene，是否是该基因的经典转录本

obs_XXX: Number of observed XXX variants in transcript，在该转录本上观察到XXX突变的数量（XXX=mis错义、syn同义、lof功能缺失）

exp_XXX: Number of expected XXX variants in transcript，在该转录本上预测到XXX突变的数量

oe_XXX: Observed over expected ratio for XXX variants in transcript (obs_XXX divided by exp_XXX)，在该转录本上观察到XXX变异超出预期的比率

mu_XXX: Mutation rate summed across all possible XXX variants in transcript，该转录本中所有可能的XXX变异的突变率总和

possible_XXX: Number of possible XXX variants in transcript，该转录中可能的XXX突变的数量（其实不是很理解这个）

obs_XXX_pphen: Number of observed XXX variants in transcript predicted "probably damaging" by PolyPhen-2，被PolyPhen-2预测为“可能有害”的、观察到的XXX突变数量

exp_XXX_pphen: Number of expected XXX variants in transcript predicted "probably damaging" by PolyPhen-2，被PolyPhen-2预测为“可能有害”的、预测察到的XXX突变数量

oe_XXX_pphen: Observed over expected ratio for PolyPhen-2 predicted "probably damaging" XXX variants in transcript (obs_mis_pphen divided by exp_mis_pphen)，被PolyPhen-2预测为“可能有害”的XXX突变，观察到超过预期的比率

possible_XXX_pphen: Number of possible missense variants in transcript that are predicted "probably damaging" by PolyPhen-2，被PolyPhen-2预测为“可能有害”的、可能的XXX突变的数量（其实也不是很理解这个）

oe_XXX_lower: Lower bound of 90% confidence interval for o/e ratio for XXX variants，XXX突变的o/e比率90%置信区间的下界

oe_XXX_upper: Upper bound of 90% confidence interval for o/e ratio for XXX variants，XXX突变的o/e比率90%置信区间的上界

XXX_z: Z score for XXX variants in gene. Higher (more positive) Z scores indicate that the transcript is more intolerant of variation (more constrained). Extreme values of XXX_z indicate likely data quality issues，基因中XXX突变的Z-score。Z-score越高(越阳性)表明该转录本越不耐受XXX变异(越受限制)。XXX_z的极端值表示可能存在数据质量问题。

pLI: Probability of loss-of-function intolerance; probability that transcript falls into distribution of haploinsufficient genes (~9% o/e pLoF ratio; computed from gnomAD data)，用gnomAD数据计算出来的pLI

pRec: Probability that transcript falls into distribution of recessive genes (~46% o/e pLoF ratio; computed from gnomAD data)，该转录本属于隐性基因的概率

pNull: Probability that transcript falls into distribution of unconstrained genes (~100% o/e pLoF ratio; computed from gnomAD data)，该转录本属于非约束基因的概率

oe_lof_upper_rank: Transcript’s rank of LOEUF value compared to all transcripts (lower values indicate more constrained)，与所有转录本相比，该转录本的LOEUF值的排名(较低的值表示更受限制)

oe_lof_upper_bin: Decile bin of LOEUF for given transcript (lower values indicate more constrained)，该转录本在十分位分类中的位置(较低的值表示更受限制)

（以上2个主要是是表示LOEUF的排序、decile 分类指标）

oe_lof_upper_bin_6: Sextile bin of LOEUF for given transcript (lower values indicate more constrained)，该转录本在六分位分类中的位置(较低的值表示更受限制)

n_sites: Number of distinct pLoF variant sites in the transcript，该转录体中不同lof突变位点的数量

classic_caf: Sum of allele frequencies of pLoFs in the transcript，该转录本中的pLoFs的等位基因频率的总和

max_af: Maximum allele frequency of any pLoF in the transcript，该转录本中的任一pLoF的最大等位基因频率

no_lofs: The number of individuals with no observed pLoF variants in the transcript，在该转录本中观察到pLoF变异的个体数量

obs_het_lof: The number of individuals with at least one observed heterozygous pLoF variant, but no homozygous pLoF variants, in the transcript，在该转录本中观察到至少一个杂合pLoF变异，但没有纯合，的个体数量

obs_hom_lof: The number of individuals with at least one observed homozygous pLoF in the transcript，在该转录本中观察到至少一个纯合pLoF变异的个体数量

defined: The number of individuals where at least one high-quality genotype (including homozygous reference) is observed at a called site annotated as a pLoF variant，至少有观察到一个高质量的pLoF突变的个体数量

p: The estimated proportion of haplotypes with a pLoF variant. Defined as: 1 - sqrt(no_lofs / defined) 一个pLoF突变的单倍型的估计比例。

exp_hom_lof: The expected number of individuals with at least one homozygous pLoF variant based on the frequency of pLoF haplotypes. Defined as: defined * p2，根据pLoF的单倍型频率计算，至少有一个纯合pLoF突变的预期个体数量。

classic_caf_POP: Sum of allele frequencies of pLoFs in the transcript among POP individuals，POP人群中pLoFs的等位基因频率的总和

p_POP: The computation of `p` repeated among only POP individuals，只在POP群体中重复的'p'值

transcript_type: Transcript biotype (https://www.gencodegenes.org/pages/biotypes.html)，转录本生物型

gene_id: Ensembl gene ID，Ensembl 的基因编号

transcript_level: Transcript level from Gencode (https://www.gencodegenes.org/pages/data_format.html)，来自Gencode的转录水平

cds_length: Length of coding sequence in gene，该基因的编码序列长度

num_coding_exons: Number of coding exons in gene，该基因上编码外显子的数量

gene_type: Gene biotype (https://www.gencodegenes.org/pages/biotypes.html)，基因生物型

gene_length: Length of gene，基因的长度

exac_pLI: pLI score calculated from ExAC，在ExAC中计算得到的pLI值

exac_obs_lof: Number of observed pLoF variants in gene in ExAC，在ExAC中pLoF突变的观察数量

exac_exp_lof: Number of expected pLoF variants in gene in ExAC，在ExAC中pLoF突变的预测数量

exac_oe_lof: Observed to expected ratio of pLoF variants in ExAC，在ExAC中pLoF突变的观察与预期的比率

brain_expression: Expression of gene in brain from GTEx data，GTEx数据中该基因在脑部的表达

chromosome: Chromosome name，染色体

start_position: Start position of gene，该基因的起始位置

end_position: End position of gene，该基因的终止始位置）

Part3: 使用annovar注释

注释数据库文件制作：

根据对header的理解，选用了默认canonical=TRUE的"gnomad.v2.1.1.lof_metrics.by_gene.txt"作为数据库数据来源；

方便起见，将最后3列移到了最前面；然后挑选了一些我自己认为对我做疾病分析可能比较重要的列：

gene，gene_id，transcript，oe_lof_upper_bin，oe_lof_upper_rank， pLI，pRec，pNull，transcript_level

制作成了一个12列的文件，命名为“hg19_LOEUF.txt”。数据量不是很大的话，做不做该文件的annovar_index都可以，想做的话，构建索引的程序index_annovar.pl可以给王凯老师发邮件获得。

另外，由于ANNOVAR的region_base注释数据库不能有header，而且所有数据会被合并到一列里面，需要后续再自己拆开，所以上面这个顺序需要另外记录好哦~

ANNOVAR注释脚本的修改：

这里感谢TSY小伙伴之前的尝试和经验分享，在annotate_variation.pl文件的3084行加入下面这段elseif：

注释一下：

$annovar/convert2annovar.pl -format vcf4 sample.vcf > sample.avinput

$annovar/table_annovar.pl sample.avinput $humandb --buildver hg19 -out sample_anno_LOEUF -remove -protocol refGene,ensGene,LOEUF -operation g,g,r --nastring . -csvout

同时把refGene和ensGene注释上去了，为了比较一下注释结果中的gene名和ID是不是都能匹配上。

这个结果可以按需要再进行分列处理。如果对其他列的信息也感兴趣想进行注释，可以参考上面的步骤进行修改和使用哦。

To Wang Lab 小伙伴：

ANNOVAR比较适合做突变位点的注释，在这个region_base的注释中，只能覆盖到这些基因的基因内区域，如coding区和intron区，如果你们的SNP是在基因间区的话应该不能直接注释。可以基于最原始的表格，直接用你们感兴趣的基因去做匹配，主要看“oe_lof_upper_bin”列的值。

gnomAD-LOEUF数据下载&使用annovar注释