纯Python read_counts 转FPKM v2

重新写了一个由reads_counts转FPKM矩阵的脚本,之前的那一般只适用于18个样本的,这里更新了一下,没有样本限制了。
还是分为3步:

grep "exon" genome.gtf > genome_exon.gtf
python count_genelen_from_gft.py genome_exon.gtf gene.len
python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix

第一步:把gtf文件中的exon抓取出来
第二步:计算每条基因的exon长度和
第三步:基于每个样本匹配到的reads总数、每个基因的长度、read_counts矩阵得到FPKM矩阵

第一步:
直接运行上面脚本的第一行就行了,记得更改输入的gtf的文件的文件名为你的文件名。
gtf范例

Morimp01GS001   AUGUSTUS        start_codon     47156   47158   .       -       0       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    47073   47158   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        CDS     47073   47158   0.6     -       0       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    46887   46999   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        CDS     46887   46999   0.92    -       1       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    46644   46784   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        CDS     46644   46784   0.91    -       2       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    45970   46559   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        CDS     45970   46559   0.92    -       2       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        stop_codon      45970   45972   .       -       0       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";

Morimp01GS001   AUGUSTUS        start_codon     54027   54029   .       +       0       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        exon    54027   54291   .       +       .       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        CDS     54027   54291   0.59    +       0       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        exon    54361   54689   .       +       .       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        CDS     54361   54689   0.22    +       2       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001   AUGUSTUS        stop_codon      54687   54689   .       +       0       gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";

输出结果范例:

Morimp01GS001   AUGUSTUS        exon    16797   16949   .       +       .       gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001   AUGUSTUS        exon    17024   17219   .       +       .       gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001   AUGUSTUS        exon    17298   18718   .       +       .       gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001   AUGUSTUS        exon    47073   47158   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001   AUGUSTUS        exon    46887   46999   .       -       .       gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";

第二步:
这里有同学反馈跑的有问题,一般是代码中基因id所在列选择的问题,我的选择标志是a[-2](按照 " 双引号分割的),如果选择有问题建议将a[-2]更改为a[1],反正就是要选到gene名所在位置
count_genelen_from_gft.py 脚本内容:

import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
f1 = open(file1,'r')
f2 = open(file2,'w')
flag = "fuck"
exon = []
for i in f1:
        a = i.split("\"")
        if flag == a[-2]:
                pos = i.split("\t")
                exon.append(abs(int(pos[4])-int(pos[3]))+1)
        elif flag == "fuck":
                flag = a[-2]
                pos = i.split("\t")
                exon.append(abs(int(pos[4])-int(pos[3]))+1)
        else:
                f2.write("{0}\t{1}\n".format(flag,sum(exon)))
                exon = []
                flag = a[-2]
                pos = i.split("\t")
                exon.append(abs(int(pos[4])-int(pos[3]))+1)
f1.close()
f2.close()

输出的gene.len范例:

MIM04M24Gene00599       2898
MIM04M24Gene00600       1035
MIM04M24Gene08324       588
MIM04M24Gene08325       468
MIM04M26Gene00001       1770
MIM04M26Gene00002       930
MIM04M26Gene00003       594
MIM04M26Gene00004       426
MIM04M26Gene00005       1002
MIM04M26Gene00006       792
MIM04M26Gene00007       1125
MIM04M26Gene00008       4041
MIM04M26Gene00009       6537
MIM04M26Gene00010       309
MIM04M26Gene00011       1293
MIM04M26Gene00012       282
MIM04M26Gene00013       765
MIM04M26Gene00014       1680
MIM04M26Gene00015       1134
MIM04M26Gene00016       648

第三步:

python脚本:Caculate_FPKM_v2.py

import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
file4 = sys.argv[4]
f1 = open(file1,'r')    # mapped gene counts of each samples
f2 = open(file2,'r')    # length of each genes
f3 = open(file3,'r')    # raw reads counts matrix
f4 = open(file4,'w')    # output file of FPKM
a = []
arrf1 = []              # mapped gene counts of each samples
dickf2 = {}             # store the genelength of each gene to the dickf2 dicttory
dickf3 = {}
for i in f1:
        i = i.strip("\n")
        if re.match('sample',i,re.IGNORECASE) == None:
                a = i.split("\t")
                arrf1.append(int(a[1]))
        else:
                continue
f1.close()
print(len(arrf1))
for i in f2:
        i = i.strip("\n")
        a = i.split("\t")
        dickf2[a[0]] = int(a[1])
f2.close()
for i in f3:
        i = i.strip("\n")
        if re.match("Geneid",i,re.IGNORECASE) == None:
                a = i.split("\t")
                dickf3[a[0]] = a[1:len(arrf1)+1]
        else:
                f4.write(i+"\n")
f3.close()
for i in dickf3.keys():
        f4.write(i+"\t")
        for j in range(len(arrf1)):
                a = int(dickf3[i][j])
                #print(a)
                try:
                        b = (a*1000000.0)/(arrf1[j]*(dickf2[i]/1000.0))
                except ZeroDivisionError:
                        b = 0
                        print(i)
                except KeyError:
                        print(i)
                        continue
                f4.write("{}".format(b))
                f4.write("\t")
        f4.write("\n")
f4.close()

输入的f1文件 mapped_reads 为map到每个样本的总的reads数量,注意第一列的头部一定要是sample开头

sample  mapped_reads
NG-1    19410286
NG-2    19299615
NG-3    18635477
NY-1    21263905
NY-2    20375441
NY-4    17433317
WG-1    1488987
WG-2    19497358
WG-3    18309507
WY-1    20455991
WY-2    18644498
WY-3    31863298

输入的f2文件 gene.len 为每个gene的exon长度和信息文件(前面已有范例)。

输入的f3文件 raw_counts.matrix 为read_counts矩阵,注意第一列的头部为Geneid ,范例:

Geneid  NG-1    NG-2    NG-3    NY-1    NY-2    NY-4    WG-1    WG-2    WG-3    WY-1    WY-2    WY-3
TraesCS1A01G000100.1    0       1       0       3       10      22      0       3       2       32      75      51
TraesCS1A01G000200.1    1       0       0       0       3       16      0       2       2       17      14      25
TraesCS1A01G000300.1    0       0       1       0       2       7       0       2       0       4       10      20
TraesCS1A01G000400.1    2       2       1       0       6       58      1       20      8       14      42      44
TraesCS1A01G000500.1    1       0       2       0       2       6       0       1       4       1       4       17

最后的输出结果为FPKM矩阵,范例:

Geneid  NG-1    NG-2    NG-3    NY-1    NY-2    NY-4    WG-1    WG-2    WG-3    WY-1    WY-2    WY-3
TraesCS5A01G005100.1    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     
TraesCS2B01G450000.1    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     
TraesCS4A01G432900.1    0.28    0.17    0.23    0.0     0.0     0.0     0.72    0.11    0.15    0.0     0.0     0.0     
TraesCSU01G138200.1     0.0     0.35    0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0     
TraesCS5D01G490900.1    1.09    1.47    0.91    0.0     0.14    0.0     0.0     0.8     0.93    0.0     0.08    0.13   

你可能感兴趣的:(Python,转录组,FPKM,python,生物信息学)