秘籍 | 计算基因组的有效大小(effective genome size)

写在前面

  • 最近在ATAC-Seq中call peaks的时候发现需要用到基因组的有效大小(effective genome size)

方法一

自己编写脚本

#coding=utf-8
import sys
aList=[]
fa_file = sys.argv[1]
with open(fa_file,'r') as f:
    for line in f:
        line = line.strip()
        line = line.upper()
        if not line.startswith(">"):
            baseA = line.count("A")
            baseT = line.count("T")
            baseC = line.count("C")
            baseG = line.count("G")
            aList.extend([baseA, baseT, baseC, baseG])
            # print(aList)
    print("effective_genome_size =", sum(aList))

运行脚本,脚本后接基因组名

python genomeSize.py m38

运行结果

effective_genome_size = 2652783500
  • 需要统计总染色体大小,在循环中加上baseN = line.count("N"),修改aList.extend([baseA, baseT, baseC, baseG, baseN])

方法二(推荐)

使用统计软件gnx

#下载与编译  
git clone https://github.com/mh11/gnx-tools.git
cd gnx-tools
mkdir bin 
javac -d bin/ src/uk/ac/ebi/gnx/* 
# 没装ant,请安装,链接:https://downloads.apache.org/ant/binaries/
# wget https://downloads.apache.org/ant/binaries/apache-ant-1.10.10-bin.tar.gz
# tar -zvxf apache-ant-1.10.10-bin.tar.gz
# ant程序 在/apache-ant-1.10.10/bin里面
ant -f package.xml
#使用方法
java -jar gnx.jar 基因组名

软件使用

java -jar gnx.jar m38

运行结果

Total number of sequences:          66
Total length of sequences:          2730871774 bp
Shortest sequence length :          1976 bp
Longest sequence length  :          195471971 bp
Total number of Ns in sequences:    78088274
N50:    130694993   (9 sequences)   (1442871972 bp combined)

结果验证,使用 - 查看有效基因组大小

echo "2730871774-78088274"|bc
2652783500

可见,与上面py脚本运行结果一致

你可能感兴趣的:(秘籍 | 计算基因组的有效大小(effective genome size))