重新写了一个由reads_counts转FPKM矩阵的脚本,之前的那一般只适用于18个样本的,这里更新了一下,没有样本限制了。
还是分为3步:
grep "exon" genome.gtf > genome_exon.gtf
python count_genelen_from_gft.py genome_exon.gtf gene.len
python Caculate_FPKM.py mapped_gene_number.txt gene.len raw_counts.matrix FPKM.matrix
第一步:把gtf文件中的exon抓取出来
第二步:计算每条基因的exon长度和
第三步:基于每个样本匹配到的reads总数、每个基因的长度、read_counts矩阵得到FPKM矩阵
第一步:
直接运行上面脚本的第一行就行了,记得更改输入的gtf的文件的文件名为你的文件名。
gtf范例
Morimp01GS001 AUGUSTUS start_codon 47156 47158 . - 0 gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS exon 47073 47158 . - . gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS CDS 47073 47158 0.6 - 0 gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS exon 46887 46999 . - . gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS CDS 46887 46999 0.92 - 1 gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS exon 46644 46784 . - . gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS CDS 46644 46784 0.91 - 2 gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS exon 45970 46559 . - . gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS CDS 45970 46559 0.92 - 2 gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS stop_codon 45970 45972 . - 0 gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS start_codon 54027 54029 . + 0 gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001 AUGUSTUS exon 54027 54291 . + . gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001 AUGUSTUS CDS 54027 54291 0.59 + 0 gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001 AUGUSTUS exon 54361 54689 . + . gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001 AUGUSTUS CDS 54361 54689 0.22 + 2 gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
Morimp01GS001 AUGUSTUS stop_codon 54687 54689 . + 0 gene_id "MIM04M26Gene00003"; transcript_id "MIM04M26Gene00003.t1"; gene_name "MIM04M26Gene00003";
输出结果范例:
Morimp01GS001 AUGUSTUS exon 16797 16949 . + . gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001 AUGUSTUS exon 17024 17219 . + . gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001 AUGUSTUS exon 17298 18718 . + . gene_id "MIM04M26Gene00001"; transcript_id "MIM04M26Gene00001.t1"; gene_name "MIM04M26Gene00001";
Morimp01GS001 AUGUSTUS exon 47073 47158 . - . gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
Morimp01GS001 AUGUSTUS exon 46887 46999 . - . gene_id "MIM04M26Gene00002"; transcript_id "MIM04M26Gene00002.t1"; gene_name "MIM04M26Gene00002";
第二步:
这里有同学反馈跑的有问题,一般是代码中基因id所在列选择的问题,我的选择标志是a[-2](按照 " 双引号分割的),如果选择有问题建议将a[-2]更改为a[1],反正就是要选到gene名所在位置
count_genelen_from_gft.py 脚本内容:
import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
f1 = open(file1,'r')
f2 = open(file2,'w')
flag = "fuck"
exon = []
for i in f1:
a = i.split("\"")
if flag == a[-2]:
pos = i.split("\t")
exon.append(abs(int(pos[4])-int(pos[3]))+1)
elif flag == "fuck":
flag = a[-2]
pos = i.split("\t")
exon.append(abs(int(pos[4])-int(pos[3]))+1)
else:
f2.write("{0}\t{1}\n".format(flag,sum(exon)))
exon = []
flag = a[-2]
pos = i.split("\t")
exon.append(abs(int(pos[4])-int(pos[3]))+1)
f1.close()
f2.close()
输出的gene.len范例:
MIM04M24Gene00599 2898
MIM04M24Gene00600 1035
MIM04M24Gene08324 588
MIM04M24Gene08325 468
MIM04M26Gene00001 1770
MIM04M26Gene00002 930
MIM04M26Gene00003 594
MIM04M26Gene00004 426
MIM04M26Gene00005 1002
MIM04M26Gene00006 792
MIM04M26Gene00007 1125
MIM04M26Gene00008 4041
MIM04M26Gene00009 6537
MIM04M26Gene00010 309
MIM04M26Gene00011 1293
MIM04M26Gene00012 282
MIM04M26Gene00013 765
MIM04M26Gene00014 1680
MIM04M26Gene00015 1134
MIM04M26Gene00016 648
第三步:
python脚本:Caculate_FPKM_v2.py
import sys,re
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
file4 = sys.argv[4]
f1 = open(file1,'r') # mapped gene counts of each samples
f2 = open(file2,'r') # length of each genes
f3 = open(file3,'r') # raw reads counts matrix
f4 = open(file4,'w') # output file of FPKM
a = []
arrf1 = [] # mapped gene counts of each samples
dickf2 = {} # store the genelength of each gene to the dickf2 dicttory
dickf3 = {}
for i in f1:
i = i.strip("\n")
if re.match('sample',i,re.IGNORECASE) == None:
a = i.split("\t")
arrf1.append(int(a[1]))
else:
continue
f1.close()
print(len(arrf1))
for i in f2:
i = i.strip("\n")
a = i.split("\t")
dickf2[a[0]] = int(a[1])
f2.close()
for i in f3:
i = i.strip("\n")
if re.match("Geneid",i,re.IGNORECASE) == None:
a = i.split("\t")
dickf3[a[0]] = a[1:len(arrf1)+1]
else:
f4.write(i+"\n")
f3.close()
for i in dickf3.keys():
f4.write(i+"\t")
for j in range(len(arrf1)):
a = int(dickf3[i][j])
#print(a)
try:
b = (a*1000000.0)/(arrf1[j]*(dickf2[i]/1000.0))
except ZeroDivisionError:
b = 0
print(i)
except KeyError:
print(i)
continue
f4.write("{}".format(b))
f4.write("\t")
f4.write("\n")
f4.close()
输入的f1文件 mapped_reads 为map到每个样本的总的reads数量,注意第一列的头部一定要是sample开头:
sample mapped_reads
NG-1 19410286
NG-2 19299615
NG-3 18635477
NY-1 21263905
NY-2 20375441
NY-4 17433317
WG-1 1488987
WG-2 19497358
WG-3 18309507
WY-1 20455991
WY-2 18644498
WY-3 31863298
输入的f2文件 gene.len 为每个gene的exon长度和信息文件(前面已有范例)。
输入的f3文件 raw_counts.matrix 为read_counts矩阵,注意第一列的头部为Geneid ,范例:
Geneid NG-1 NG-2 NG-3 NY-1 NY-2 NY-4 WG-1 WG-2 WG-3 WY-1 WY-2 WY-3
TraesCS1A01G000100.1 0 1 0 3 10 22 0 3 2 32 75 51
TraesCS1A01G000200.1 1 0 0 0 3 16 0 2 2 17 14 25
TraesCS1A01G000300.1 0 0 1 0 2 7 0 2 0 4 10 20
TraesCS1A01G000400.1 2 2 1 0 6 58 1 20 8 14 42 44
TraesCS1A01G000500.1 1 0 2 0 2 6 0 1 4 1 4 17
最后的输出结果为FPKM矩阵,范例:
Geneid NG-1 NG-2 NG-3 NY-1 NY-2 NY-4 WG-1 WG-2 WG-3 WY-1 WY-2 WY-3
TraesCS5A01G005100.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TraesCS2B01G450000.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TraesCS4A01G432900.1 0.28 0.17 0.23 0.0 0.0 0.0 0.72 0.11 0.15 0.0 0.0 0.0
TraesCSU01G138200.1 0.0 0.35 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
TraesCS5D01G490900.1 1.09 1.47 0.91 0.0 0.14 0.0 0.0 0.8 0.93 0.0 0.08 0.13