进行数据整理时所写的脚本,使用CD-HIT去冗余,设置阈值为100%,将有多条的簇留下来。
cd1.txt:
从上图中找出找出100%的簇,放到cd2.txt中
cd2.txt
with open('cd1.txt', 'r') as f:
lines = f.readlines()
with open('cd2.txt', 'w') as f_w:
for i in range(len(lines)):
if(lines[i][0] == '0' and lines[i+1][0]=='1'):
f_w.write(lines[i - 1])
print(i, "***", lines[i - 1])
f_w.write(lines[i])
print(i, "***", lines[i])
f_w.write(lines[i + 1])
print(i, "***", lines[i + 1])
i = i + 2
while(lines[i][0] != '>'):
f_w.write(lines[i])
print(i, "***", lines[i])
i = i + 1
continue
else:
i = i + 1