–在这里只给出一部分能够说明问题即可
ATOM 1 N MET A 1 36.644 -24.949 8.853 1.00 29.12 N
ATOM 2 CA MET A 1 36.942 -23.581 8.984 1.00 19.55 C
ATOM 3 C MET A 1 35.712 -22.887 9.526 1.00 22.27 C
ATOM 4 O MET A 1 34.626 -23.375 9.258 1.00 18.31 O
ATOM 5 CB MET A 1 37.365 -23.090 7.599 1.00 8.40 C
ATOM 6 CG MET A 1 37.639 -21.603 7.644 1.00 30.36 C
ATOM 7 SD MET A 1 39.309 -21.106 7.226 1.00 39.80 S
ATOM 8 CE MET A 1 40.241 -22.126 8.356 1.00 44.83 C
ATOM 9 N ASN A 2 35.890 -21.796 10.310 1.00 17.74 N
ATOM 10 CA ASN A 2 34.809 -21.015 10.918 1.00 5.91 C
ATOM 11 C ASN A 2 35.236 -19.557 10.931 1.00 11.34 C
ATOM 12 O ASN A 2 36.390 -19.244 10.620 1.00 9.58 O
ATOM 13 CB ASN A 2 34.487 -21.602 12.355 1.00 9.68 C
ATOM 14 CG ASN A 2 35.645 -21.566 13.309 1.00 11.82 C
ATOM 15 OD1 ASN A 2 36.176 -20.496 13.515 1.00 18.42 O
ATOM 16 ND2 ASN A 2 36.013 -22.685 13.919 1.00 9.43 N
ATOM 17 N ILE A 3 34.315 -18.689 11.287 1.00 6.65 N
ATOM 18 CA ILE A 3 34.543 -17.262 11.353 1.00 10.61 C
ATOM 19 C ILE A 3 35.794 -16.849 12.198 1.00 14.73 C
–在这里也就是我们的clean_pdb.py
–该有注释的地方我也给了注释,由于python对于中文编码的问题比较麻烦,索性我就都用英文写了注释。
import sys
import os
from sys import argv, stderr, stdout
from os import popen, system
from os.path import exists, basename
from optparse import OptionParser
# Local package imports''''''
from amino_acids import longer_names
from amino_acids import modres
# remote host for downloading pdbs
remote_host = ''
shit_stat_insres = False
shit_stat_altpos = False
shit_stat_modres = False
shit_stat_misdns = False # missing density!
fastaseq = {}
pdbfile = ""
def download_pdb(pdb_id, dest_dir):
# print "downloading %s" % ( pdb_id )
url = 'http://www.rcsb.org/pdb/files/%s.pdb.gz' % (pdb_id.upper())
dest = '%s/%s.pdb.gz' % (os.path.abspath(dest_dir), pdb_id)
wget_cmd = 'wget --quiet %s -O %s' % (url, dest)
print wget_cmd
if remote_host:
wget_cmd = 'ssh %s %s' % (remote_host, wget_cmd)
lines = popen(wget_cmd).readlines()
if (exists(dest)):
return dest
else:
print "Error: didn't download file!"
def check_and_print_pdb(count, residue_buffer, residue_letter):
# declare the global variable
global pdbfile
# Check that CA, N and C are present!def check_and_print_pdb( outid, residue_buffer )
hasCA = False
hasN = False
hasC = False
for line in residue_buffer:
atomname = line[12:16]
# Only add bb atoms if they have occupancy!
occupancy = float(line[55:60])
# From this three if,we can see that residue must contains "CA","N","C"
if atomname == " CA " and occupancy > 0.0:
hasCA = True
if atomname == " N " and occupancy > 0.0:
hasN = True
if atomname == " C " and occupancy > 0.0:
hasC = True
# if all three backbone atoms are present withoccupancy proceed to print the residue
if hasCA and hasN and hasC:
for line in residue_buffer:
# add linear residue count
newnum = '%4d ' % count
line_edit = line[0:22] + newnum + line[27:]
# write the residue line
pdbfile = pdbfile + line_edit
# finally print residue letter into fasta strea
chain = line[21]
try:
fastaseq[chain] += residue_letter
except KeyError:
fastaseq[chain] = residue_letter
# count up residue number
count = count + 1
return True
return False
# get the filename of the PDB file
def get_pdb_filename( name ):
'''Tries various things to get the filename to use.
Returns None if no acceptable file exists.'''
if( os.path.exists( name ) ):
return name
if( os.path.exists( name + '.pdb' ) ):
return name + '.pdb'
if( os.path.exists( name + '.pdb.gz' ) ):
return name + '.pdb.gz'
if( os.path.exists( name + '.pdb1.gz' ) ):
return name + '.pdb1.gz'
name = name.upper()
if( os.path.exists( name ) ):
return name
if( os.path.exists( name + '.pdb' ) ):
return name + '.pdb'
if( os.path.exists( name + '.pdb.gz' ) ):
return name + '.pdb.gz'
if( os.path.exists( name + '.pdb1.gz' ) ):
return name + '.pdb1.gz'
# No acceptable file found
return None
# open the ".gz" file and ".pdb" file ,return the file:lines and filename:stem
def open_pdb( name ):
'''Open the PDB given in the filename (or equivalent).
If the file is not found, then try downloading it from the internet.
Returns: (lines, filename_stem)
'''
filename = get_pdb_filename( name )
if filename is not None:
print "Found existing PDB file at", filename
else:
print "File for %s doesn't exist, downloading from internet." % (name)
filename = download_pdb(name[0:4].upper(), '.')
global files_to_unlink
files_to_unlink.append(filename)
stem = os.path.basename(filename)
if stem[-3:] == '.gz':
stem = stem[:-3]
if stem[-5:] == '.pdb1':
stem = stem[:-5]
if stem[-4:] == '.pdb':
stem = stem[:-4]
if filename[-3:] == '.gz':
lines = popen('zcat '+filename, 'r').readlines()
else:
lines = open(filename, 'r').readlines()
return lines, stem
#############################################
# Program Start
#############################################
**parser = OptionParser(usage="%prog [options] " ,
description=__doc__)
parser.add_option("--nopdbout", action="store_true",
help="Don't output a PDB.")
parser.add_option("--allchains", action="store_true",
help="Use all the chains from the input PDB.")
parser.add_option("--removechain", action="store_true",
help="Remove chain information from output PDB.")
parser.add_option("--keepzeroocc", action="store_true",
help="Keep zero occupancy atoms in output.")
options, args = parser.parse_args()**
if 'nopdbout' in args:
options.nopdbout = True
args.remove('nopdbout')
if 'ignorechain' in args:
options.allchains = True
#Don't remove, because we're also using it as a chain designator
if 'nochain' in args:
options.removechain = True
options.allchains = True
#Don't remove, because we're also using it as a chain designator
if len(args) != 2:
parse.error("Must specify both the pdb and the chain id")
files_to_unlink = []
if args[1].strip() != "ignorechain" and args[1].strip() != "nochain":
chainid = args[1].upper()
else:
chainid = args[1]
lines, filename_stem = open_pdb( args[0] )
oldresnum = ' '
count = 1
residue_buffer = []
residue_letter = ''
if chainid == '_':
chainid = ' '
for line in lines:
if line.startswith('ENDMDL'): break # Only take the first NMR model
if len(line) > 21 and ( line[21] in chainid or options.allchains):
if line[0:4] != "ATOM" and line[0:6] != 'HETATM':
continue
#copy the line which is going to be edited
line_edit = line
#get the residue of the line
resn = line[17:20]
# Is it a modified residue ?
# (Looking for modified residues in both ATOM and HETATM records is deliberate)
if modres.has_key(resn):
# if so replace it with its canonical equivalent !
orig_resn = resn
resn = modres[resn]
line_edit = 'ATOM '+line[6:17]+ resn + line[20:]
if orig_resn == "MSE":
# don't count MSE as modified residues for flagging purposes (because they're so common)
# Also, fix up the selenium atom naming
if (line_edit[12:14] == 'SE'):
line_edit = line_edit[0:12]+' S'+line_edit[14:]
if len(line_edit) > 75:
if (line_edit[76:78] == 'SE'):
line_edit = line_edit[0:76]+' S'+line_edit[78:]
else:
shit_stat_modres = True
# Only process residues we know are valid.
if not longer_names.has_key(resn):
continue
resnum = line_edit[22:27]
# Is this a new residue
if not resnum == oldresnum:
if residue_buffer != []: # is there a residue in the buffer ?
if not check_and_print_pdb(count, residue_buffer, residue_letter):
# if unsuccessful
shit_stat_misdns = True
else:
count = count + 1
residue_buffer = []
residue_letter = longer_names[resn]
oldresnum = resnum
insres = line[26]
if insres != ' ':
shit_stat_insres = True
altpos = line[16]
if altpos != ' ':
shit_stat_altpos = True
if altpos == 'A':
line_edit = line_edit[:16]+' '+line_edit[17:]
else:
# Don't take the second and following alternate locations
continue
if options.removechain:
line_edit = line_edit[:21]+' '+line_edit[22:]
if options.keepzeroocc:
line_edit = line_edit[:55] +" 1.00"+ line_edit[60:]
residue_buffer.append(line_edit)
if residue_buffer != []: # is there a residue in the buffer ?
if not check_and_print_pdb(count, residue_buffer, residue_letter):
# if unsuccessful
shit_stat_misdns = True
else:
count = count + 1
flag_altpos = "---"
if shit_stat_altpos:
flag_altpos = "ALT"
flag_insres = "---"
if shit_stat_insres:
flag_insres = "INS"
flag_modres = "---"
if shit_stat_modres:
flag_modres = "MOD"
flag_misdns = "---"
if shit_stat_misdns:
flag_misdns = "DNS"
nres = len("".join(fastaseq.values()))
flag_successful = "OK"
if nres <= 0:
flag_successful = "BAD"
if chainid == ' ':
chainid = '_'
print filename_stem, "".join(chainid), "%5d" % nres, flag_altpos, flag_insres, flag_modres, flag_misdns, flag_successful
if nres > 0:
if not options.nopdbout:
# outfile = string.lower(pdbname[0:4]) + chainid + pdbname[4:]
outfile = filename_stem + "_" + chainid + ".pdb"
outid = open(outfile, 'w')
outid.write(pdbfile)
outid.write("TER\n")
outid.close()
fastaid = stdout
if not options.allchains:
for chain in fastaseq:
fastaid.write('>'+filename_stem+"_"+chain+'\n')
fastaid.write(fastaseq[chain])
fastaid.write('\n')
handle = open(filename_stem+"_"+"".join(chain) + ".fasta", 'w')
handle.write('>'+filename_stem+"_"+"".join(chain)+'\n')
handle.write(fastaseq[chain])
handle.write('\n')
handle.close()
else:
fastaseq = ["".join(fastaseq.values())]
fastaid.write('>'+filename_stem+"_"+chainid+'\n')
fastaid.writelines(fastaseq)
fastaid.write('\n')
handle = open(filename_stem+"_"+chainid + ".fasta", 'w')
handle.write('>'+filename_stem+"_"+chainid+'\n')
handle.writelines(fastaseq)
handle.write('\n')
handle.close()
if len(files_to_unlink) > 0:
for file in files_to_unlink:
os.unlink(file)
–文中加粗的地方是我学习了这个脚本后印象最深刻的地方,这个可以定义终端操作以及操作参数,这个很巧妙,我们也可以自己定义自己的Python脚本工具使用的命令。
–通过./clean_pdb.py 2LZM A运行上述脚本后会生成一个PDB文件和一个fasta氨基酸序列文件,PDB文件较之前的PDB文件,没有了前面的描述文件,具体大家可上PDB BANK 搜索2LZM.pdb下载查看。
stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py 2LZM A
File for 2LZM doesn't exist, downloading from internet.
wget --quiet http://www.rcsb.org/pdb/files/2LZM.pdb.gz -O /home/stern/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts/2LZM.pdb.gz
2LZM A 164 --- --- MOD --- OK
>2LZM_A
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL
stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py R001 nochain
Found existing PDB file at R001.pdb
R001 nochain 306 --- --- MOD DNS OK
>R001_nochain
APQDPRNLPIRQQMEALIRRKQAEITQGLESIDTVKFHADTWTRGNDGGGGTSMVIQDGTTFEKGGVNVSVVYGQLSPAAVSAMKADHFFACGLSMVIHPVNPHAPTTHLNYRYFETWNQDGTPQTWWFGGGADLTPSYLYEEDGQLFHQLHKDALDKHDTALYPRFKKWCDEYFYITHRKETR
GIGGIFFDDYDERDPQEILKMVEDCFDAFLPSYLTIVKRRKDMPYTKEEQQWQAIRRGRYVEFNLIYDRGTQFGLRTPGSRVESILMSLPEHASWLYNHHPAPGSREAKLLEVTTKPREWVK
–下面的这个R001.pdb是我自己做预测用的,PDB BANK上面是没有的。
–今天虽然是只做了这一点点事,但是自己通过细细查看这个脚本文件,正确的运行并使用了这个工具,得到了自己所需要的结果和文件,然后就是学会了自定义Python脚本在终端运行的操作命令。刚运行时出现了一大堆错,贴出来,给大家看看。
–这个就是编码的错误
stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py R001 ignorechain
File "./clean_pdb.py", line 68
SyntaxError: Non-ASCII character '\xe5' in file ./clean_pdb.py on line 68, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
–这个是文件名过长的问题,这个代码中定义了PDB文件的名是4位的,所以当我的文件名是5位时发生了错误。
stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py R0001 A
File for R0001 doesn't exist, downloading from internet.
wget --quiet http://www.rcsb.org/pdb/files/R000.pdb.gz -O /home/stern/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts/R000.pdb.gz
gzip: /home/stern/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts/R000.pdb.gz: unexpected end of file
R000 A 0 --- --- --- --- BAD
–这个是缺少操作符的。
stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py R001
Traceback (most recent call last):
File "./clean_pdb.py", line 191, in <module>
parse.error("Must specify both the pdb and the chain id")
NameError: name 'parse' is not defined
以上博客,是学习Rosetta Commons生物大分子预测所得,由于网上关于蛋白质预测的资料很少,而且有关于Rosetta的资料更是少之又少,所以自己花一个小时的时间,把自己一天的所看所想所思写出来,给大家分享交流,如有不同意见的欢迎互相交流,我也是一枚入门的小菜鸟,听说每天写一个博客总结一下自己的一天,这样坚持一两年就能找到好工作,于是我就来写啦,哈哈哈哈,奥,忘了讲了,我现在研一刚结束,希望坚持到自己毕业能找到好工作。
明天见!!!