『比较基因组学』TBtools 30秒解决基因组共线性。

直接给我惊呆了,从点开开始到出结果一共32秒ಥ_ಥ,想当年光blastp就得跑大半天,更不要说编译MCScanX,各种奇葩问题...


先看下什么是共线性:

Syntenic* = a set of loci in two different species which is located on the same chromosome in each (not necessarily in the same order).
Collinear = a set of loci in two different species which are located on the same chromosome in each, and are conserved in the same order.
Synteny now has the same meaning as collinear, despite the different origins of the terms
Source: Genomics and comparative genomics

collinear

之前做共线性分析的时候用的是MCscanX,还有python版本的JCVI,第一步的blastp基本是要了老命了,当时为了加快blastp的速度,把query的pep.fa文件切成几十个批量跑....反正也是折腾了好久。
最近CJ发布了两个插件:OneStepMCScanX-DiamondQuick Genome Dot Plot,从名字上就能看出来这个是怎么回事了,就是基于Diamond进行序列比对(Diamond比blast快很多倍!而且准确度也相差不是很多)其实相当于把序列比对和比对完的结果用MCScanX生成共线性文件,clt文件,包括gtf文件的合并等融合到一个步骤中了。当然现在的OneStep只是两个基因组间的比较。

具体方法

插件 | 超快!全基因组共线性分析
插件 | 超快!全基因组 Dot Plot
简洁 | 优雅地准备 比较基因组分析 文件

可能遇到的其他问题

去除scaffold

一般在共线性展示的时候,如果你的参考基因组组装到了染色体水平,scaffold或者contig等是要去掉的。如果有一点shell基础的话这个就很简单了
比如

  • 首先看下你gff文件的头文件,看看染色体编号形式和scaffold或者contig等的编号形式,这里我们看下JGI的雷蒙德氏棉的gff3
head -30 gene.Grai.JGI.gff3
##gff-version 3
##sequence-region   Chr01 6158 55678711
##sequence-region   Chr02 75 62758601
##sequence-region   Chr03 1 45758218
##sequence-region   Chr04 2059 62173553
##sequence-region   Chr05 6721 64137711
##sequence-region   Chr06 50440 51073468
##sequence-region   Chr07 1895 60974818
##sequence-region   Chr08 36764 57128522
##sequence-region   Chr09 8203 70705631
##sequence-region   Chr10 1619 62168479
##sequence-region   Chr11 3378 62659606
##sequence-region   Chr12 22670 35427486
##sequence-region   Chr13 518 58321163
##sequence-region   scaffold_102 2100 3605
##sequence-region   scaffold_107 2825 16443
##sequence-region   scaffold_108 3 13412
##sequence-region   scaffold_111 13319 15939
##sequence-region   scaffold_112 4529 5790
##sequence-region   scaffold_114 53 12661
##sequence-region   scaffold_115 3197 6044
##sequence-region   scaffold_116 268 1961
##sequence-region   scaffold_119 6 12264
##sequence-region   scaffold_122 5843 8827
##sequence-region   scaffold_127 2607 11788

再看下中棉所亚洲棉的gff3

head -30 gene.Garb.CRI.gff3 
##gff-version 3
##sequence-region   Chr01 1290 113026203
##sequence-region   Chr02 50819 99045064
##sequence-region   Chr03 942 135702384
##sequence-region   Chr04 55724 98593909
##sequence-region   Chr05 32236 97875121
##sequence-region   Chr06 5813 132229846
##sequence-region   Chr07 21338 97506195
##sequence-region   Chr08 8313 129422927
##sequence-region   Chr09 10279 85030627
##sequence-region   Chr10 64098 129481832
##sequence-region   Chr11 5334 124529361
##sequence-region   Chr12 6026 103049777
##sequence-region   Chr13 15788 123934920
##sequence-region   tig00000076 27806 38254
##sequence-region   tig00000208 71464 82652
##sequence-region   tig00000247 19374 20588
##sequence-region   tig00000248 17267 18598
##sequence-region   tig00000254 5842 784687
##sequence-region   tig00000266 95796 99132
##sequence-region   tig00000267 76876 81478
##sequence-region   tig00000312 258908 329614
##sequence-region   tig00000458 14417 14761
##sequence-region   tig00000498 12571 1498618
##sequence-region   tig00000545 29227 220599
##sequence-region   tig00000736 198220 210945
##sequence-region   tig00000744 182460 186937
##sequence-region   tig00000762 9152 12184
##sequence-region   tig00000807 27129 30536
##sequence-region   tig00000820 21055 21872

现在我们发现了再雷棉的gff3中是scaffold,而再亚洲棉的gff3中是tig,那么我们用grep -v(取反)将带有contig或者scaffold的行去掉

grep -v "tig" gene.Garb.CRI.gff3 > notigs.Garb.CRI.gff3
grep -v "scaff" gene.Grai.JGI.gff3 > notigs.Grai.CRI.gff3

检查一下:

head -30 notigs.Garb.CRI.gff3
##gff-version 3
##sequence-region   Chr01 1290 113026203
##sequence-region   Chr02 50819 99045064
##sequence-region   Chr03 942 135702384
##sequence-region   Chr04 55724 98593909
##sequence-region   Chr05 32236 97875121
##sequence-region   Chr06 5813 132229846
##sequence-region   Chr07 21338 97506195
##sequence-region   Chr08 8313 129422927
##sequence-region   Chr09 10279 85030627
##sequence-region   Chr10 64098 129481832
##sequence-region   Chr11 5334 124529361
##sequence-region   Chr12 6026 103049777
##sequence-region   Chr13 15788 123934920
Chr01   EVM gene    1290    21313   .   -   .   ID=Ga01G0001
###
Chr01   EVM mRNA    1290    21313   .   -   .   ID=Ga01G0001.1
Chr01   EVM CDS 1290    1426    .   -   2   ID=cds.Ga01G0001.1;Parent=Ga01G0001.1
Chr01   EVM CDS 21271   21313   .   -   0   ID=cds.Ga01G0001.1;Parent=Ga01G0001.1
###
Chr01   EVM gene    26354   29394   .   +   .   ID=Ga01G0002;Name=LPLAT1;description=Lysophospholipid acyltransferase 1 
###
Chr01   EVM mRNA    26354   29394   .   +   .   ID=Ga01G0002.1;Name=LPLAT1;description=Lysophospholipid acyltransferase 1 
Chr01   EVM CDS 26354   26396   .   +   0   ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01   EVM CDS 26960   27184   .   +   2   ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01   EVM CDS 27238   27699   .   +   2   ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01   EVM CDS 28202   28287   .   +   2   ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01   EVM CDS 28442   28505   .   +   0   ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01   EVM CDS 28585   28703   .   +   2   ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
Chr01   EVM CDS 28894   29027   .   +   0   ID=cds.Ga01G0002.1;Parent=Ga01G0002.1
###################################
head -30 notigs.Grai.CRI.gff3
##gff-version 3
##sequence-region   Chr01 6158 55678711
##sequence-region   Chr02 75 62758601
##sequence-region   Chr03 1 45758218
##sequence-region   Chr04 2059 62173553
##sequence-region   Chr05 6721 64137711
##sequence-region   Chr06 50440 51073468
##sequence-region   Chr07 1895 60974818
##sequence-region   Chr08 36764 57128522
##sequence-region   Chr09 8203 70705631
##sequence-region   Chr10 1619 62168479
##sequence-region   Chr11 3378 62659606
##sequence-region   Chr12 22670 35427486
##sequence-region   Chr13 518 58321163
Chr01   phytozomev10    gene    6158    6456    .   -   .   ID=Gorai.001G000100;Name=Trmt61a;description=tRNA (adenine(58)-N(1))-methyltransferase catalytic subunit TRMT61A 
Chr01   phytozomev10    mRNA    6158    6456    .   -   .   ID=Gorai.001G000100.1;Parent=Gorai.001G000100;Name=Gorai.001G000100.1;pacid=26824196;longest=1
Chr01   phytozomev10    CDS 6158    6415    .   -   0   ID=Gorai.001G000100.1.CDS.1;Parent=Gorai.001G000100.1;pacid=26824196
Chr01   phytozomev10    five_prime_UTR  6416    6456    .   -   .   ID=Gorai.001G000100.1.five_prime_UTR.1;Parent=Gorai.001G000100.1;pacid=26824196
###
Chr01   phytozomev10    gene    6987    7491    .   -   .   ID=Gorai.001G000200
Chr01   phytozomev10    mRNA    6987    7491    .   -   .   ID=Gorai.001G000200.1;Parent=Gorai.001G000200;Name=Gorai.001G000200.1;pacid=26820978;longest=1
Chr01   phytozomev10    three_prime_UTR 6987    7146    .   -   .   ID=Gorai.001G000200.1.three_prime_UTR.1;Parent=Gorai.001G000200.1;pacid=26820978
Chr01   phytozomev10    CDS 7147    7479    .   -   0   ID=Gorai.001G000200.1.CDS.1;Parent=Gorai.001G000200.1;pacid=26820978

现在就干净了。
如果不会shell这些操作怎么办呢,别着急,我们有另外一个工具Text Block Extract and filter(再TBtools 搜索栏直接搜索)
按照下图的操作即可

Text Block Extract and filter

发现染色体排序不符合预期

乱序的染色体排布

如果用过MCScanX 你一定会了解到ctl文件的作用。我们看下这个文件:

2500
600
Ga-Chr11,Ga-Chr10,Ga-Chr02,Ga-Chr13,Ga-Chr01,Ga-Chr12,Ga-Chr04,Ga-Chr03,Ga-Chr06,Ga-Chr05,Ga-Chr08,Ga-Chr07,Ga-Chr09
Gr-Chr11,Gr-Chr10,Gr-Chr02,Gr-Chr13,Gr-Chr01,Gr-Chr12,Gr-Chr04,Gr-Chr03,Gr-Chr06,Gr-Chr05,Gr-Chr08,Gr-Chr07,Gr-Chr09

看到了吧,罪魁祸首在这里,我们只需要手动的把这个配置文件的染色体顺序改正确就可以了

2500
600
Ga-Chr01,Ga-Chr02,Ga-Chr03,Ga-Chr04,Ga-Chr05,Ga-Chr06,Ga-Chr07,Ga-Chr08,Ga-Chr09,Ga-Chr10,Ga-Chr11,Ga-Chr12,Ga-Chr13
Gr-Chr01,Gr-Chr02,Gr-Chr03,Gr-Chr04,Gr-Chr05,Gr-Chr06,Gr-Chr07,Gr-Chr08,Gr-Chr09,Gr-Chr10,Gr-Chr11,Gr-Chr12,Gr-Chr13
改ctl之后

发散下思维 前面也可以不去掉scaffold,到时候直接这里面把没有到染色体水平的那些玩意给删了是不是也可以,我这里没有测试。

你可能感兴趣的:(『比较基因组学』TBtools 30秒解决基因组共线性。)