分子钟理论认为基因序列中密码子随着时间的推移以几乎恒定的比例相互替换,即具有恒定的演化速率,因此两个物种之间的遗传距离将与物种的分歧时间成正比。我们通常采用单拷贝直系同源基因中的四倍简并位点(4DTV)来估算分子钟(替换速率)以及物种间的分歧时间。密码子中的四重简并位点第三碱基不改变所编码的氨基酸,属于中性进化,其中中性替换速率一般用每个位点每年的变异数来衡量。目前,采用贝叶斯统计或其他方法估计物种分歧时间的程序很多,例如R8S、MCMCTREE、MULTIDIVTIME、BEAST等,通过不同的策略将化石时间信息整合到一个系统发育树中,从而计算得到Divergence time Tree。
1. 构建单拷贝直系同源基因
2. 提取四倍简并位点
use strict;
use warnings;
use Bio::SeqIO;
my @in=;
my %seq;
my %name;
my $alllen=0;
my $number;
for my $in (sort @in){
my $fa=Bio::SeqIO->new(-file=>"$in",-format=>"fasta");
my $len;
while (my $seq=$fa->next_seq) {
my $id=$seq->id;
my $seq=$seq->seq;
if ($len % 3 != 0){
die "wrong: $in\t$id\n";
my @seq=split(//,$seq);
for (my $i=0;$i<@seq;$i += 3){
my $word=$seq[$i].$seq[$i+1].$seq[$i+2];
die "wrong: $in\t$id\n" if ($word=~/-/ && $word=~/\w/);
###check done
my @id=split(/\|/,$id);
$seq{$id[0]} .= $seq;
$name{$id[0]} .= $id;
$alllen += $len;
open (O,">$0.list");
print O "Length:\t$alllen\tNumber:\t$number\n";
for my $k1 (sort keys %name){
my $v=$name{$k1};
my @v=split(/$k1\|/,$v);
shift @v;
print O "$k1\t",scalar(@v),"\t",join(",",@v),"\n";
close O;
open (O,">$0.connect.cds.fa");
for my $k1 (sort keys %seq){
print O ">$k1\n$seq{$k1}\n";
close O;
use strict;
##Author: Fan Wei
##Email: [email protected]
##Data: 2008-12-22
my $file = shift;
##get matrix of Ka and Ks from KaKs_caculator's result
sub mfa2axt{
my $mfa_file = shift;
my (%name_seq,%pair,$output);
open OUT, ">$mfa_file.axt" || die "fail $mfa_file.axt";
open IN, $mfa_file || die "fail open $mfa_file\n";
$/=">"; ; $/="\n";
while () {
my $name = $1 if(/^(\S+)/);
my $seq = ;
chomp $seq;
$seq =~ s/\s//g;
$name_seq{$name} = $seq;
close IN;
foreach my $first (sort keys %name_seq) {
foreach my $second (sort keys %name_seq) {
next if($first eq $second || exists $pair{"$second&$first"});
$pair{"$first&$second"} = 1;
foreach (sort keys %pair) {
if (/([^&]+)&([^&]+)/) {
print OUT $_."\n".$name_seq{$1}."\n".$name_seq{$2}."\n\n";
close OUT;
use strict;
##author: sun ming'an, [email protected]
##modifier: fanwei, [email protected]
##correction: LiJun, [email protected]
##Date: 2008-9-24
##4dtv (transversion rate on 4-fold degenerated sites) are calculated with HKY substitution models
##Reference: M. Hasegawa, H. Kishino, and T. Yano, J. Mol. Evol. 22 (2), 160 (1985)
die "perl $0 AXTfile > outfile\n" unless( @ARGV == 1);
my %codons=(
'CTT'=>'L', 'CTC'=>'L', 'CTA'=>'L', 'CTG'=>'L',
'GTT'=>'V', 'GTC'=>'V', 'GTA'=>'V', 'GTG'=>'V',
'TCT'=>'S', 'TCC'=>'S', 'TCA'=>'S', 'TCG'=>'S',
'CCT'=>'P', 'CCC'=>'P', 'CCA'=>'P', 'CCG'=>'P',
'ACT'=>'T', 'ACC'=>'T', 'ACA'=>'T', 'ACG'=>'T',
'GCT'=>'A', 'GCC'=>'A', 'GCA'=>'A', 'GCG'=>'A',
'CGT'=>'R', 'CGC'=>'R', 'CGA'=>'R', 'CGG'=>'R',
'GGT'=>'G', 'GGC'=>'G', 'GGA'=>'G', 'GGG'=>'G');
my %transversion = (
"A" => "TC",
"C" => "AG",
"G" => "TC",
"T" => "AG",
my $axtFile = shift;
open(AXT,"$axtFile")||die"Cannot open $axtFile\n";
$/ = "\n\n";
my @seqs = ;
$/ ="\n";
close AXT;
print "tag\t4dtv_corrected\t4dtv_raw\tcondon_4d\tcodon_4dt\n";
foreach my $line ( @seqs ){
chomp $line;
if( $line =~ /^(\S+)\n(\S+)\n(\S+)$/ ){
my $tag = $1;
my $seq1 =$2;
my $seq2 =$3;
my ($corrected_4dtv, $raw_4dtv, $condon_4d, $codon_4dt) = &calculate_4dtv($seq1, $seq2);
print "$tag\t$corrected_4dtv\t$raw_4dtv\t$condon_4d\t$codon_4dt\n";
sub calculate_4dtv {
my($str1, $str2) = @_;
my ($condon_4d, $codon_4dt) = (0,0);
my ($V,$a,$b,$d) = (0,0,0,0);
my %fre=();
for( my $i = 0; $i < length($str1); $i += 3){
my $codon1 = substr($str1, $i, 3);
my $codon2 = substr($str2, $i, 3);
my $base1= uc(substr($str1, $i+2, 1));
my $base2= uc(substr($str2, $i+2, 1));
if( exists $codons{$codon1} && exists $codons{$codon2} && $codons{$codon1} eq $codons{$codon2} ){
$codon_4dt++ if(is_transversion($codon1,$codon2));
if($condon_4d > 0){
$V=$codon_4dt / $condon_4d; ##this is raw 4dtv value
##correction the raw 4dtv values by HKY substitution model
foreach (keys %fre){
if($fre{Y}!=0 && $fre{R}!=0 && $fre{A}!=0 && $fre{C}!=0 && $fre{G}!=0 && $fre{T}!=0){
if (1-$V/(2*$fre{Y}*$fre{R}) > 0) {
$d = "NA";
$d = "NA";
return ($d,$V,$condon_4d, $codon_4dt);
sub is_transversion{
my ($codon1,$codon2) = @_;
my $is_transversion = 0;
my $base1 = substr($codon1,2,1);
my $base2 = substr($codon2,2,1);
$is_transversion = 1 if (exists $transversion{$base1} && $transversion{$base1} =~ /$base2/);
return $is_transversion;
3. 构建物种树
iqtree -s allSingleCopyOrthologsAlign.4Dsite.fas -m MFP -b 1000 -nt 5
raxml-ng --bootstrap --msa prim.phy --model GTR+G --prefix T8 --seed 2 --threads 2 --bs-trees 200
4. 获取化石分歧时间
通过网站http://www.timetree.org/ ,该网站根据多篇文献支持提供两两物种间的分化时间,并给出置信度范围,单位是Mya(million years ago)。另外一个可以查询分化时间的网站是https://fossilcalibrations.org/ ,可以互相参考。
5. 估算物种分歧时间
- step 1: 估计替换速率
13 1
注:必须标注末端时间节点(单位;亿年 )
seqfile = allSingleCopyOrthologsAlign.4Dsite.fas
treefile = species.tree0
outfile = mlb * main result file
noisy = 3 * 0,1,2,3: how much rubbish on the screen
verbose = 1 * 1: detailed output, 0: concise output
runmode = 0 * 0:user tree; 1:semi-automatic; 2:automatic
* 3:StepwiseAddition; (4,5):PerturbationNNI
model = 7 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85, 5:T92, 6:TN93, 7:REV
* 8:UNREST, 9:REVu; 10:UNRESTu
Mgene = 0 * 0:rates, 1:separate; 2:diff pi, 3:diff kappa, 4:all diff
clock = 1 * 0:no clock, 1:clock; 2:local clock; 3:CombinedAnalysis
fix_kappa = 0 * 0: estimate kappa; 1: fix kappa at value below
kappa = 2 * initial or fixed kappas
fix_alpha = 0 * 0: estimate alpha; 1: fix alpha at value below
alpha = 0.5 * initial or fixed alpha, 0:infinity (constant rate)
ncatG = 5 * # of categories in the dG, AdG, or nparK models of rates
fix_rho = 1 * 0: estimate rho; 1: fix rho at value below
rho = 0 * initial or fixed rho, 0:no correlation
Malpha = 0 * 1: different alpha's for genes, 0: one alpha
nparK = 0 * rate-class models. 1:rK, 2:rK&fK, 3:rK&MK(1/K), 4:rK&MK
getSE = 1 * 0: don't want SEs of estimates, 1: want SEs
RateAncestor = 0 * (0,1,2): rates (alpha>0) or ancestral states
method = 0 * Optimization method 0: simultaneous; 1: one branch a time
Small_Diff = 0.5e-6
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
fix_blength = 0 * 0: ignore, -1: random, 1: initial, 2: fixed
baseml baseml.ctl
- step 2: 估计Gradient and Hessian of Branch Lengths
13 1
**注:单位百万年 **
seed = -1
seqfile = allSingleCopyOrthologsAlign.4Dsite.fas
treefile = species.tree1
outfile = out_usedata3
ndata = 1
usedata = 3 * 0: no data; 1:seq like; 2:normal approximation
clock = 3 * 1: global clock; 2: independent rates; 3: correlated rates
* RootAge = '<10' * constraint on root age, used if no fossil for root.
model = 7 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
alpha = 0.5 * alpha for gamma rates at sites
ncatG = 5 * No. categories in discrete gamma
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
BDparas = 1 1 0 * birth, death, sampling
kappa_gamma = 6 2 * gamma prior for kappa
alpha_gamma = 1 1 * gamma prior for alpha
rgene_gamma = 1 2.288 * gamma prior for rate for genes ### 1/替换率
sigma2_gamma = 1 4.5 * gamma prior for sigma^2 (for clock=2 or 3)
finetune = 1: 0.06 0.5 0.006 0.12 0.4 * times, rates, mixing, paras, RateParas
print = 1
burnin = 500000
sampfreq = 5000
nsample = 20000
*** Note: Make your window wider (100 columns) when running this program.
mcmctree mcmctree3.ctl
- step 3: 估算物种分歧时间
mv out.BV in.BV
seed = -1
seqfile = allSingleCopyOrthologsAlign.4Dsite.fas
treefile = species.tree1
outfile = out_usedata2
ndata = 1
usedata = 2 * 0: no data; 1:seq like; 2:normal approximation
clock = 3 * 1: global clock; 2: independent rates; 3: correlated rates
* RootAge = '<10' * constraint on root age, used if no fossil for root.
model = 7 * 0:JC69, 1:K80, 2:F81, 3:F84, 4:HKY85
alpha = 0.5 * alpha for gamma rates at sites
ncatG = 5 * No. categories in discrete gamma
cleandata = 0 * remove sites with ambiguity data (1:yes, 0:no)?
BDparas = 1 1 0 * birth, death, sampling
kappa_gamma = 6 2 * gamma prior for kappa
alpha_gamma = 1 1 * gamma prior for alpha
rgene_gamma = 1 2.288 * gamma prior for rate for genes ### 1/替换率
sigma2_gamma = 1 4.5 * gamma prior for sigma^2 (for clock=2 or 3)
finetune = 1: 0.06 0.5 0.006 0.12 0.4 * times, rates, mixing, paras, RateParas
print = 1
burnin = 500000
sampfreq = 5000
nsample = 20000
*** Note: Make your window wider (100 columns) when running this program.
nohup mcmctree mcmctree2.ctl >mcmctree.ctl.log 2>&1 &;
6. 检测运行结果
最直接的检测方法是:分别使用不同的Seed值进行mcmctree或infinitesites进行两次或多次分析,然后比较两个结果树是否一致,实际就是比较树文件中各内部节点的Height值(分歧时间 / Posterior time)。计算各枝长总的偏差百分比,当偏差百分比低于0.1%,则认为两次结果非常吻合,差异低于0.1%,认为达到收敛。此外,还可以使用Tracer分析mcmc.txt文件,检测其ESS值,一般认为该值高于200,则可能达到收敛。该方法可用于辅助检测。最后,若不收敛,则需要提高burnin、nsample值,重新运行程序。
- http://www.chenlianfu.com/?p=2974
- https://www.jianshu.com/p/f9e5fe95478d
- http://www.fish-evol.org/mcmctreeExampleVert6/text1Eng.html
- http://abacus.gene.ucl.ac.uk/software/MCMCtreeStepByStepManual.pdf
- https://dosreislab.github.io/2017/10/24/marginal-likelihood-mcmc3r.html
- http://nebc.nerc.ac.uk/bioinformatics/documentation/paml/doc/MCMCtreeDoc.pdf