Rosetta Protein Prediction Tools-----clean_pdb.py

Rosetta Protein Prediction Tools-----clean_pdb.py_第1张图片

输入文件(也就是PDB文件)

–在这里只给出一部分能够说明问题即可

ATOM      1  N   MET A   1      36.644 -24.949   8.853  1.00 29.12           N  
ATOM      2  CA  MET A   1      36.942 -23.581   8.984  1.00 19.55           C  
ATOM      3  C   MET A   1      35.712 -22.887   9.526  1.00 22.27           C  
ATOM      4  O   MET A   1      34.626 -23.375   9.258  1.00 18.31           O  
ATOM      5  CB  MET A   1      37.365 -23.090   7.599  1.00  8.40           C  
ATOM      6  CG  MET A   1      37.639 -21.603   7.644  1.00 30.36           C  
ATOM      7  SD  MET A   1      39.309 -21.106   7.226  1.00 39.80           S  
ATOM      8  CE  MET A   1      40.241 -22.126   8.356  1.00 44.83           C  
ATOM      9  N   ASN A   2      35.890 -21.796  10.310  1.00 17.74           N  
ATOM     10  CA  ASN A   2      34.809 -21.015  10.918  1.00  5.91           C  
ATOM     11  C   ASN A   2      35.236 -19.557  10.931  1.00 11.34           C  
ATOM     12  O   ASN A   2      36.390 -19.244  10.620  1.00  9.58           O  
ATOM     13  CB  ASN A   2      34.487 -21.602  12.355  1.00  9.68           C  
ATOM     14  CG  ASN A   2      35.645 -21.566  13.309  1.00 11.82           C  
ATOM     15  OD1 ASN A   2      36.176 -20.496  13.515  1.00 18.42           O  
ATOM     16  ND2 ASN A   2      36.013 -22.685  13.919  1.00  9.43           N  
ATOM     17  N   ILE A   3      34.315 -18.689  11.287  1.00  6.65           N  
ATOM     18  CA  ILE A   3      34.543 -17.262  11.353  1.00 10.61           C  
ATOM     19  C   ILE A   3      35.794 -16.849  12.198  1.00 14.73           C  

Python脚本文件

–在这里也就是我们的clean_pdb.py
–该有注释的地方我也给了注释,由于python对于中文编码的问题比较麻烦,索性我就都用英文写了注释。

import sys
import os
from sys import argv, stderr, stdout
from os import popen, system
from os.path import exists, basename
from optparse import OptionParser

# Local package imports''''''

from amino_acids import longer_names
from amino_acids import modres

# remote host for downloading pdbs
remote_host = ''

shit_stat_insres = False
shit_stat_altpos = False
shit_stat_modres = False
shit_stat_misdns = False  # missing density!

fastaseq = {}
pdbfile = ""


def download_pdb(pdb_id, dest_dir):
    # print "downloading %s" % ( pdb_id )
    url = 'http://www.rcsb.org/pdb/files/%s.pdb.gz' % (pdb_id.upper())
    dest = '%s/%s.pdb.gz' % (os.path.abspath(dest_dir), pdb_id)
    wget_cmd = 'wget --quiet %s -O %s' % (url, dest)
    print wget_cmd
    if remote_host:
        wget_cmd = 'ssh %s %s' % (remote_host, wget_cmd)

    lines = popen(wget_cmd).readlines()
    if (exists(dest)):
        return dest
    else:
        print "Error: didn't download file!"


def check_and_print_pdb(count, residue_buffer, residue_letter):
    # declare the global variable
    global pdbfile 
  # Check that CA, N and C are present!def check_and_print_pdb( outid, residue_buffer )
    hasCA = False
    hasN = False
    hasC = False
    for line in residue_buffer:
        atomname = line[12:16]
        # Only add bb atoms if they have occupancy!
        occupancy = float(line[55:60])
        # From this three if,we can see that residue must contains "CA","N","C"
        if atomname == " CA " and occupancy > 0.0:
            hasCA = True
        if atomname == " N  " and occupancy > 0.0:
            hasN = True
        if atomname == " C  " and occupancy > 0.0:
            hasC = True

  # if all three backbone atoms are present withoccupancy proceed to print the residue
    if hasCA and hasN and hasC:
        for line in residue_buffer:
            # add linear residue count
            newnum = '%4d ' % count
            line_edit = line[0:22] + newnum + line[27:]
            # write the residue line
            pdbfile = pdbfile + line_edit

    # finally print residue letter into fasta strea
        chain = line[21]
        try:
            fastaseq[chain] += residue_letter
        except KeyError:
            fastaseq[chain] = residue_letter
    # count up residue number
        count = count + 1
        return True
    return False

# get the filename of the PDB file
def get_pdb_filename( name ):
    '''Tries various things to get the filename to use.
    Returns None if no acceptable file exists.'''
    if( os.path.exists( name ) ):
        return name
    if( os.path.exists( name + '.pdb' ) ):
        return name + '.pdb'
    if( os.path.exists( name + '.pdb.gz' ) ):
        return name + '.pdb.gz'
    if( os.path.exists( name + '.pdb1.gz' ) ):
        return name + '.pdb1.gz'
    name = name.upper()
    if( os.path.exists( name ) ):
        return name
    if( os.path.exists( name + '.pdb' ) ):
        return name + '.pdb'
    if( os.path.exists( name + '.pdb.gz' ) ):
        return name + '.pdb.gz'
    if( os.path.exists( name + '.pdb1.gz' ) ):
        return name + '.pdb1.gz'
    # No acceptable file found
    return None

# open the ".gz" file and ".pdb" file ,return the file:lines and filename:stem
def open_pdb( name ):
    '''Open the PDB given in the filename (or equivalent).
    If the file is not found, then try downloading it from the internet.

    Returns: (lines, filename_stem)
    '''
    filename = get_pdb_filename( name )
    if filename is not None:
        print "Found existing PDB file at", filename
    else:
        print "File for %s doesn't exist, downloading from internet." % (name)
        filename = download_pdb(name[0:4].upper(), '.')
        global files_to_unlink
        files_to_unlink.append(filename)

    stem = os.path.basename(filename)
    if stem[-3:] == '.gz':
        stem = stem[:-3]
    if stem[-5:] == '.pdb1':
        stem = stem[:-5]
    if stem[-4:] == '.pdb':
        stem = stem[:-4]

    if filename[-3:] == '.gz':
        lines = popen('zcat '+filename, 'r').readlines()
    else:
        lines = open(filename, 'r').readlines()

    return lines, stem

#############################################
# Program Start
#############################################

**parser = OptionParser(usage="%prog [options]  ",
        description=__doc__)
parser.add_option("--nopdbout", action="store_true",
        help="Don't output a PDB.")
parser.add_option("--allchains", action="store_true",
        help="Use all the chains from the input PDB.")
parser.add_option("--removechain", action="store_true",
        help="Remove chain information from output PDB.")
parser.add_option("--keepzeroocc", action="store_true",
        help="Keep zero occupancy atoms in output.")
options, args = parser.parse_args()**

if 'nopdbout' in args:
    options.nopdbout = True
    args.remove('nopdbout')

if 'ignorechain' in args:
    options.allchains = True
    #Don't remove, because we're also using it as a chain designator

if 'nochain' in args:
    options.removechain = True
    options.allchains = True
    #Don't remove, because we're also using it as a chain designator

if len(args) != 2:
    parse.error("Must specify both the pdb and the chain id")

files_to_unlink = []

if args[1].strip() != "ignorechain" and args[1].strip() != "nochain":
    chainid = args[1].upper()
else:
    chainid = args[1]

lines, filename_stem = open_pdb( args[0] )

oldresnum = '   '
count = 1

residue_buffer = []
residue_letter = ''

if chainid == '_':
    chainid = ' '

for line in lines:

    if line.startswith('ENDMDL'): break  # Only take the first NMR model
    if len(line) > 21 and ( line[21] in chainid or options.allchains):
        if line[0:4] != "ATOM" and line[0:6] != 'HETATM':
            continue

        #copy the line which is going to be edited
        line_edit = line
        #get the residue of the line
        resn = line[17:20]

        # Is it a modified residue ?
        # (Looking for modified residues in both ATOM and HETATM records is deliberate)
        if modres.has_key(resn):
            # if so replace it with its canonical equivalent !
            orig_resn = resn
            resn = modres[resn]
            line_edit = 'ATOM  '+line[6:17]+ resn + line[20:]

            if orig_resn == "MSE":
                # don't count MSE as modified residues for flagging purposes (because they're so common)
                # Also, fix up the selenium atom naming
                if (line_edit[12:14] == 'SE'):
                    line_edit = line_edit[0:12]+' S'+line_edit[14:]
                if len(line_edit) > 75:
                    if (line_edit[76:78] == 'SE'):
                        line_edit = line_edit[0:76]+' S'+line_edit[78:]
            else:
                shit_stat_modres = True

        # Only process residues we know are valid.
        if not longer_names.has_key(resn):
            continue

        resnum = line_edit[22:27]

        # Is this a new residue
        if not resnum == oldresnum:
            if residue_buffer != []:  # is there a residue in the buffer ?
                if not check_and_print_pdb(count, residue_buffer, residue_letter):
                    # if unsuccessful
                    shit_stat_misdns = True
                else:
                    count = count + 1

            residue_buffer = []
            residue_letter = longer_names[resn]

        oldresnum = resnum

        insres = line[26]
        if insres != ' ':
            shit_stat_insres = True

        altpos = line[16]
        if altpos != ' ':
            shit_stat_altpos = True
            if altpos == 'A':
                line_edit = line_edit[:16]+' '+line_edit[17:]
            else:
                # Don't take the second and following alternate locations
                continue

        if options.removechain:
            line_edit = line_edit[:21]+' '+line_edit[22:]

        if options.keepzeroocc:
            line_edit = line_edit[:55] +" 1.00"+ line_edit[60:]

        residue_buffer.append(line_edit)


if residue_buffer != []: # is there a residue in the buffer ?
    if not check_and_print_pdb(count, residue_buffer, residue_letter):
        # if unsuccessful
        shit_stat_misdns = True
    else:
        count = count + 1

flag_altpos = "---"
if shit_stat_altpos:
    flag_altpos = "ALT"
flag_insres = "---"
if shit_stat_insres:
    flag_insres = "INS"
flag_modres = "---"
if shit_stat_modres:
    flag_modres = "MOD"
flag_misdns = "---"
if shit_stat_misdns:
    flag_misdns = "DNS"

nres = len("".join(fastaseq.values()))

flag_successful = "OK"
if nres <= 0:
    flag_successful = "BAD"

if chainid == ' ':
    chainid = '_'

print filename_stem, "".join(chainid), "%5d" % nres, flag_altpos,  flag_insres,  flag_modres,  flag_misdns, flag_successful

if nres > 0:
    if not options.nopdbout:
        # outfile = string.lower(pdbname[0:4]) + chainid + pdbname[4:]
        outfile = filename_stem + "_" + chainid + ".pdb"

        outid = open(outfile, 'w')
        outid.write(pdbfile)
        outid.write("TER\n")
        outid.close()

    fastaid = stdout
    if not options.allchains:
        for chain in fastaseq:
            fastaid.write('>'+filename_stem+"_"+chain+'\n')
            fastaid.write(fastaseq[chain])
            fastaid.write('\n')
            handle = open(filename_stem+"_"+"".join(chain) + ".fasta", 'w')
            handle.write('>'+filename_stem+"_"+"".join(chain)+'\n')
            handle.write(fastaseq[chain])
            handle.write('\n')
            handle.close()
    else:
        fastaseq = ["".join(fastaseq.values())]
        fastaid.write('>'+filename_stem+"_"+chainid+'\n')
        fastaid.writelines(fastaseq)
        fastaid.write('\n')
        handle = open(filename_stem+"_"+chainid + ".fasta", 'w')
        handle.write('>'+filename_stem+"_"+chainid+'\n')
        handle.writelines(fastaseq)
        handle.write('\n')
        handle.close()

if len(files_to_unlink) > 0:
    for file in files_to_unlink:
        os.unlink(file)

–文中加粗的地方是我学习了这个脚本后印象最深刻的地方,这个可以定义终端操作以及操作参数,这个很巧妙,我们也可以自己定义自己的Python脚本工具使用的命令。

输出

–通过./clean_pdb.py 2LZM A运行上述脚本后会生成一个PDB文件和一个fasta氨基酸序列文件,PDB文件较之前的PDB文件,没有了前面的描述文件,具体大家可上PDB BANK 搜索2LZM.pdb下载查看。

stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py 2LZM A
File for 2LZM doesn't exist, downloading from internet.
wget --quiet http://www.rcsb.org/pdb/files/2LZM.pdb.gz -O /home/stern/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts/2LZM.pdb.gz
2LZM A   164 --- --- MOD --- OK
>2LZM_A
MNIFEMLRIDEGLRLKIYKDTEGYYTIGIGHLLTKSPSLNAAKSELDKAIGRNCNGVITKDEAEKLFNQDVDAAVRGILRNAKLKPVYDSLDAVRRCALINMVFQMGETGVAGFTNSLRMLQQKRWDEAAVNLAKSRWYNQTPNRAKRVITTFRTGTWDAYKNL

stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py R001 nochain
Found existing PDB file at R001.pdb
R001 nochain   306 --- --- MOD DNS OK
>R001_nochain
APQDPRNLPIRQQMEALIRRKQAEITQGLESIDTVKFHADTWTRGNDGGGGTSMVIQDGTTFEKGGVNVSVVYGQLSPAAVSAMKADHFFACGLSMVIHPVNPHAPTTHLNYRYFETWNQDGTPQTWWFGGGADLTPSYLYEEDGQLFHQLHKDALDKHDTALYPRFKKWCDEYFYITHRKETR
GIGGIFFDDYDERDPQEILKMVEDCFDAFLPSYLTIVKRRKDMPYTKEEQQWQAIRRGRYVEFNLIYDRGTQFGLRTPGSRVESILMSLPEHASWLYNHHPAPGSREAKLLEVTTKPREWVK

–下面的这个R001.pdb是我自己做预测用的,PDB BANK上面是没有的。
–今天虽然是只做了这一点点事,但是自己通过细细查看这个脚本文件,正确的运行并使用了这个工具,得到了自己所需要的结果和文件,然后就是学会了自定义Python脚本在终端运行的操作命令。刚运行时出现了一大堆错,贴出来,给大家看看。

–这个就是编码的错误

stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py R001 ignorechain
  File "./clean_pdb.py", line 68
SyntaxError: Non-ASCII character '\xe5' in file ./clean_pdb.py on line 68, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

–这个是文件名过长的问题,这个代码中定义了PDB文件的名是4位的,所以当我的文件名是5位时发生了错误。

stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py R0001 A
File for R0001 doesn't exist, downloading from internet.
wget --quiet http://www.rcsb.org/pdb/files/R000.pdb.gz -O /home/stern/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts/R000.pdb.gz

gzip: /home/stern/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts/R000.pdb.gz: unexpected end of file
R000 A     0 --- --- --- --- BAD

–这个是缺少操作符的。

stern@stern-OptiPlex-7040:~/Documents/rosetta_bin_linux_2017.08.59291_bundle/tools/protein_tools/scripts$ ./clean_pdb.py R001
Traceback (most recent call last):
  File "./clean_pdb.py", line 191, in <module>
    parse.error("Must specify both the pdb and the chain id")
NameError: name 'parse' is not defined
   以上博客,是学习Rosetta Commons生物大分子预测所得,由于网上关于蛋白质预测的资料很少,而且有关于Rosetta的资料更是少之又少,所以自己花一个小时的时间,把自己一天的所看所想所思写出来,给大家分享交流,如有不同意见的欢迎互相交流,我也是一枚入门的小菜鸟,听说每天写一个博客总结一下自己的一天,这样坚持一两年就能找到好工作,于是我就来写啦,哈哈哈哈,奥,忘了讲了,我现在研一刚结束,希望坚持到自己毕业能找到好工作。
   明天见!!!

I CAN

你可能感兴趣的:(Rosetta Protein Prediction Tools-----clean_pdb.py)