大家可以看最新版https://blog.csdn.net/qq_26012913/article/details/111939262?spm=1001.2014.3001.5501
首先我们要把gtf文件中的exon抓取出来
grep "exon" genome.gtf > genome_exon.gtf
然后提取genome_exon.gtf文件中的gene的exon的长度和得到我们想要的gene的长度
python count_genelen_from_gft.py genome_exon.gtf gene.len
这其中count_genelen_from_gft.py的代码如下:
import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
f1 = open(file1,'r')
f2 = open(file2,'w')
flag = "fuck"
exon = []
for i in f1:
a = i.split("\"")
if flag == a[-2]:
pos = i.split("\t")
exon.append(abs(int(pos[4])-int(pos[3]))+1)
elif flag == "fuck":
flag = a[-2]
pos = i.split("\t")
exon.append(abs(int(pos[4])-int(pos[3]))+1)
else:
f2.write("{0}\t{1}\n".format(flag,sum(exon)))
exon = []
flag = a[-2]
pos = i.split("\t")
exon.append(abs(int(pos[4])-int(pos[3]))+1)
f1.close()
f2.close()
就此我们得到了单个基因的长度,存在gene.len文件中eg:
MIM04M24Gene00599 2898 MIM04M24Gene00600 1035 MIM04M24Gene08324 588 MIM04M24Gene08325 468 MIM04M26Gene00001 1770 MIM04M26Gene00002 930 MIM04M26Gene00003 594 MIM04M26Gene00004 426 MIM04M26Gene00005 1002 MIM04M26Gene00006 792 MIM04M26Gene00007 1125 MIM04M26Gene00008 4041 MIM04M26Gene00009 6537 MIM04M26Gene00010 309 MIM04M26Gene00011 1293 MIM04M26Gene00012 282 MIM04M26Gene00013 765 MIM04M26Gene00014 1680 MIM04M26Gene00015 1134 MIM04M26Gene00016 648
我们还要提取准备一下我们每个样本的mapped_reads数的文件,内容如下:
Total Mapped reads reads number
A1A 18836863
A1B 15478037
A1C 19394549
A2A 19976617
A2B 15964986
A2C 19685810
A3A 18080220
A3B 16627794
A3C 20205794
A4A 16867356
A4B 16409921
A4C 19966924
A5A 17322230
A5B 15118648
A5C 19086094
A6A 17352130
A6B 16489332
A6C 19940296
然后我再展示一下我的read_counts矩阵文件,我的文件名为:raw_counts.matrix
文件内容eg:
A1A A1B A1C A2A A2B A2C A3A A3B A3C A4A A4B A4C A5A A5B A5C A6A A6B A6C
MIM04M24Gene00599 334 179 300 532 261 376 238 284 312 306 191 260 105 187 191 204 177
MIM04M24Gene00600 98 58 80 134 84 122 44 47 65 20 23 27 9 16 16 51 12
MIM04M24Gene08324 13 7 16 19 11 16 15 12 30 19 16 16 11 8 15 29 16
MIM04M24Gene08325 18 18 13 18 21 25 37 30 45 26 32 36 23 22 28 56 31
MIM04M26Gene00001 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
以这三个文件作为输入,我们就能通过脚本得到FPKM矩阵
python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix
其中的Caculate_FPKM.py脚本内容贴下:
import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
file4 = sys.argv[4]
f1 = open(file1,'r')
f2 = open(file2,'r')
f3 = open(file3,'r')
f4 = open(file4,'w')
a = []
arrf1 = []
dickf2 = {}
dickf3 = {}
for i in f1:
i = i.strip("\n")
if re.match('A',i):
a = i.split("\t")
arrf1.append(int(a[1]))
else:
continue
f1.close()
for i in f2:
i = i.strip("\n")
a = i.split("\t")
dickf2[a[0]] = int(a[1])
f2.close()
for i in f3:
i = i.strip("\n")
if re.match("M",i):
a = i.split("\t")
dickf3[a[0]] = a[1:19]
else:
f4.write(i)
f3.close()
for i in dickf3.keys():
f4.write(i+"\t")
for j in range(0,18):
a = int(dickf3[i][j])
#print(a)
try:
b = (a*1000000.0)/(arrf1[j]*(dickf2[i]/1000.0))
except ZeroDivisionError:
b = 0
except KeyError:
continue
f4.write("{}".format(b))
f4.write("\t")
f4.write("\n")
f4.close()
最后我做一个完整的傻瓜式脚本,只要大家准备好gtf文件、mapped_reads文件、read_counts文件和两个python脚本到一个目录下跑就行了
总脚本如下:
grep "exon" genome.gtf > genome_exon.gtf
python count_genelen_from_gft.py genome_exon.gtf gene.len
python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix
希望能对大家有所帮助,有困难可以给我发邮件[email protected]