
Data preparation


  1. 基因组序列信息,存储基因组序列信息的.fasta文件。还有其蛋白质序列,也是以.fasta结尾的文件。一般来说注释的比较好的基因组都会含有这些文件。
  2. 基因组基因结构注释信息。储存基因的intron,exon,CDS,gene等坐标信息的.gff3或.gtf文件。
  3. 所感兴趣的基因家族隐马可夫模型,hmm文件
HMMER3.1 manual:http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf

hmmbuild/hmmsearch/hmmscan/hmmalign 这几个功能是主要用于蛋白质结构与分析和注释的hmmer中小工具

在鉴定基因家族时,常用到的工具是hmmsearch,里面常用的算法有三种。一般我们使用--cut_tc算法对隐马可夫模型进行搜索,tc算法是使用pfam提供的hmm文件中trusted cutoof的值进行筛选,相对比较可靠。

  --cut_ga : use profile's GA gathering cutoffs to set all thresholding
  --cut_nc : use profile's NC noise cutoffs to set all thresholding
  --cut_tc : use profile's TC trusted cutoffs to set all thresholding



Identification of NBS-LRR genes

Predicted proteins from the cassava genome were scanned using HMMER v3 [39] using the Hidden Markov Model (HMM) corresponding to the Pfam [40] NBS (NB-ARC) family (PF00931; http://pfam.sanger.ac.uk/). From the proteins obtained using the raw NBS HMM, a high-quality protein set (E-value < 1 × 10−20 and manual verification of an intact NBS domain) was aligned and used to construct a cassava-specific NBS HMM using hmmbuild from the HMMER v3 suite. This new cassava-specific HMM was used, and all proteins with an E-value lower than 0.01 were selected. NBS-LRR genes were further filtered based on manual curation and functional annotation against both the closest homolog from Arabidopsis and the UNIREF100 sequence database. Most of the proteins that were removed had at least a partial kinase domain, but no relationship to NBS-LRR genes; this result was expected because the NBS domain has smaller kinase subdomains


这副图就是对应了该文章的基因家族鉴定思路,首先在全基因组的范围内使用hmmersearch和NBS-ARC基因家族的隐马可夫模型进行基因家族的进行初步搜索,接着把质量比较高的基因家族候选基因筛选出来E-value < 1 × 10−20, 然后使用clustalw2对高质量的序列进行多序列比对,多序列比对后,对这些置信的序列进行隐马可夫模型的构建(使用hmmbuild),最后使用该新建的隐马可夫模型,进一步筛选完整的NSB基因家族序列(需再次过滤,找到基因家族的成员数量一般比第一步初步筛选的多)。


hmmsearch --cut_tc --domtblout NBS-ABC.out NBS-ARC.hmm Arabidopsis_thaliana.TAIR10.pep.all.fa

head NBS-ABC.out
#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
AT1G61180.1          -            889 NB-ARC               PF00931.22   252   1.4e-90  304.3   0.6   1   1   2.2e-92   2.5e-90  303.5   0.6     1   251   156   397   156   398 0.99 pep chromosome:TAIR10:1:22551271:22554684:1 gene:AT1G61180 transcript:AT1G61180.1 gene_biotype:protein_coding transcript_biotype:protein_coding description:LRR and NB-ARC domains-containing disease resistance protein [Source:UniProtKB/TrEMBL;Acc:Q2V4G0]
AT1G61180.2          -            899 NB-ARC               PF00931.22   252   1.5e-90  304.2   0.6   1   1   2.2e-92   2.5e-90  303.5   0.6     1   251   156   397   156   398 0.99 pep chromosome:TAIR10:1:22551271:22554684:1 gene:AT1G61180 transcript:AT1G61180.2 gene_biotype:protein_coding transcript_biotype:protein_coding description:LRR and NB-ARC domains-containing disease resistance protein [Source:UniProtKB/TrEMBL;Acc:Q2V4G0]

grep -v "#" NBS-ABC.out|awk '($7 + 0) < 1E-20'|cut -f1 -d  " "|sort -u > NBS-ARC_qua_id.txt

~/biosoft/seqtk/seqtk subseq Arabidopsis_thaliana.TAIR10.pep.all.fa NBS-ARC_qua_id.txt >NBS-ARC_qua.fa


hmmbuild NBS-ARC.second.out  NBS-ARC_qua.aln 

hmmsearch --cut_tc --domtblout NBS-ARC.second.out NBS-ARC_qua.hmm ../Arabidopsis_thaliana.TAIR10.pep.all.fa


grep -v "#" NBS-ABC.second.out|awk '($7 + 0) < 1E-03' | cut -f1 -d " "|sort -u >final.NBS.list

~/biosoft/seqtk/seqtk subseq Arabidopsis_thaliana.TAIR10.pep.all.fa final.NBS.list >final_NBS-ARC_qua.fa

BLAST-based method



makeblastdb -in ref.nbs.plant.fa -dbtype prot

cat blastp.out |awk '$3>75' |cut -f1 |sort -u > blastp_result_id.list

最后我们还可将上述两种方法重合的gene id,找出两种方法共有的基因家族,这样结果就比较置信了。

comm -12 blastp_result_id.list hmm_out_id.list > common.list

~/biosoft/seqtk/seqtk subseq Arabidopsis_thaliana.TAIR10.pep.all.fa common.list >final_searh_NBS-ARC_qua.fa


比如:NCBI CD-Search tool


又或者:InterProScan sequence search

