利用perl编写计算回交材料其中一个亲本片段占比的脚本

问题:两个玉米自交系A和B,先进行杂交,产生的F1与B进行回交,假设有100对标记用于区分A和B。在BC1中,我们利用这100对标记对每个单株进行了分型。求每个单株的亲本片段占比?

考虑到每个标记代表的片段长度不相同,我们用如下脚本完成这个比例计算。

需要三个输入文件:

  • 基因组长度文件 (以玉米B73 V4为例,每行包含染色体编号及长度)

file1:
1 307041717
2 244442276
3 235667834
4 246994605
5 223902240
6 174033170
7 182381542
8 181122637
9 159769782
10 150982314

  • 标记位置文件(只需要染色体和位置信息即可,用-连接)

file2:
1-12699720

  • 需要计算的BC1单株的分型结果中一个亲本的标记(每行一个单株)

file3:
plant1 1-185105598 1-201198031 1-223298828 3-69002888 3-69002888
plant2 1-223298828 3-69002888 3-186335692 3-211312218 3-229253739
plant3 6-38164584 6-40024641 6-83579108 7-14105029 7-49113016

perl脚本如下:

#!/usr/bin/perl

use warnings;
use strict;

open GENOME_LEN,"file1";
my %genome_length = ();
my $genome_length = 0;
while () {
    chomp;
    my @a = split/\s+/;
    $genome_length{$a[0]} = $a[1];
    $genome_length += $a[1];
}
close GENOME_LEN;

open MARKER_INFO,"file2";
my %marker_info = ();
while () {
    chomp;
    my @a = split/-/;
    push @{$marker_info{$a[0]}},$a[1];
}
close MARKER_INFO;

my %marker = ();
foreach my $chr (keys %marker_info) {
    chomp $chr;
    my @sort_marker_info = sort{$a<=>$b}@{$marker_info{$chr}};
    my $start = 0;
    my $end = 0;
    foreach my $a (0..$#sort_marker_info) {
        if ($a == 0) {
            $start = 0;
        }else{
            $start = int(($sort_marker_info[$a] + $sort_marker_info[$a-1])/2);
        }
        if ($a == $#sort_marker_info) {
            $end = $genome_length{$chr};
        }else{
            $end = int(($sort_marker_info[$a+1] + $sort_marker_info[$a])/2);
        }
        $marker{$chr}{$sort_marker_info[$a]} = $end - $start;
    }
}

open POSITIVE_MARKER,"file3";
open OUT,">recovery_rate.txt";


print OUT "Marker_name\tGenome_length\tRecovery_length\tRecovery_rate\n";
while () {
    chomp;
    next if /^$/;
    my @a = split/\s+/;
    print OUT "$a[0]\t";
    my $recovery_length = 0;
    foreach  (1..$#a) {
        $a[$_] =~ /(\d+)-(\d+)/;
        $recovery_length += $marker{$1}{$2};
        #print OUT "\n$1-$2\t$marker{$1}{$2}\n";
    }
    print OUT "$genome_length\t$recovery_length\t";
    printf OUT "%.2f",(($recovery_length/$genome_length)*100);
    print OUT "%\n";
}
close POSITIVE_MARKER;
close OUT;

输出结果包含四列:Marker_name、Genome_length、Recovery_length和Recovery_rate。

你可能感兴趣的:(利用perl编写计算回交材料其中一个亲本片段占比的脚本)