AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://blog.csdn.net/caroline_wendy/article/details/130089781

整体:

RCSB PDB 数据集是一个收集了蛋白质的三维结构信息的数据库,是世界蛋白质数据库(wwPDB)的成员之一,也是生物学和医学领域第一个开放访问的数字数据资源库。RCSB PDB 数据集不仅提供了来自蛋白质数据银行(PDB)档案的实验确定的3D结构,还提供了来自 AlphaFold DB 和 ModelArchive 的计算结构模型(CSM)。用户可以利用 RCSB PDB 数据集提供的各种工具和资源,根据序列、结构和功能的注释进行简单和高级搜索,可视化、下载和分析这些分子,并且在外部注释的背景下,探索生物学的结构视角 。

关于 Complex 和 Multimer 的差别:

组合在一起形成功能性基团的复合物,通常会叫Complex,比如Antibody-antigen Complex 或 Ligand-receptor Complex;Multimer通常指堆在一起的非单体情况,不一定有真正的结合或可以发挥功能,只是结构上在一起,比如aggregation发生时,通常会提到Monomer/Multimer。总体,Multimer在生物学上用的不多。以上我的理解和习惯,也可能不同的文章中会有人混用,尤其是非母语文章,也不能算是错误。

RCSB:Research Collaboratory for Structural Bioinformatics,即结构生物信息学的研究合作实验室。

  • 官网:https://www.pdbus.org/
  • 目前,已经有 202,467 (2023.3.21) 个PDB结构。
  • Vision:To expand the frontiers of fundamental biology, biomedicine, energy sciences, and biotechnology through open and sustainable access to the 3D structure, function, and evolution of biological macromolecules contained in the Protein Data Bank (PDB) archive.
  • 愿景:扩展基础生物学、生物医学、能源科学和生物技术的前沿方向,通过开放的和可持续的访问 PDB 档案中所包括的生物大分子的 3D 结构、功能和进化。

1. RCSB PDB

PDB全量数据:最后更新日期 2022.4.13

  • 全量链数:593491,约59万
  • 全量PDB数:183653个PDB结构,实际包括182703个,略有差异,相差950个,约18万

标签如下:

  1. id:行号,从0开始
  2. pdb:PDB编号,例如 3eo1
  3. resolution:分辨率,例如 3.1
  4. release_date:发布日期,例如 2008-12-02
  5. seq:序列,例如 ETVLTQSPGT…
  6. len:序列长度,例如 215
  7. chain_type:链的类型,例如 k是kappa型轻链,l是lamda
  8. bcr_or_tcr:BCR或TCR或none

第1个示例,PDB 3eo1,包括0-11,一共12行,即12个链。具体数据如下:

bid,pdb,chain,resolution,release_date,seq,len,chain_type,bcr_or_tcr
0,3eo1,A,3.1,2008-12-02,ETVLTQSPGTLSLSPGERATLSCRASQSLGSSYLAWYQQKPGQAPRLLIYGASSRAPGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYADSPITFGQGTRLEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC,215,k,BCR
1,3eo1,B,3.1,2008-12-02,QVQLVQSGAEVKKPGSSVKVSCKASGYTFSSNVISWVRQAPGQGLEWMGGVIPIVDIANYAQRFKGRVTITADESTSTTYMELSSLRSEDTAVYYCASTLGLVLDAMDYWGQGTLVTVSSASTKGPSVFPLAPCSESTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTKTYTCNVDHKPSNTKVDKRVES,216,h,BCR
2,3eo1,C,3.1,2008-12-02,ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS,112,protein,none
3,3eo1,D,3.1,2008-12-02,ETVLTQSPGTLSLSPGERATLSCRASQSLGSSYLAWYQQKPGQAPRLLIYGASSRAPGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYADSPITFGQGTRLEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC,215,k,BCR
4,3eo1,E,3.1,2008-12-02,QVQLVQSGAEVKKPGSSVKVSCKASGYTFSSNVISWVRQAPGQGLEWMGGVIPIVDIANYAQRFKGRVTITADESTSTTYMELSSLRSEDTAVYYCASTLGLVLDAMDYWGQGTLVTVSSASTKGPSVFPLAPCSESTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTKTYTCNVDHKPSNTKVDKRVES,216,h,BCR
5,3eo1,F,3.1,2008-12-02,ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS,112,protein,none
6,3eo1,G,3.1,2008-12-02,ETVLTQSPGTLSLSPGERATLSCRASQSLGSSYLAWYQQKPGQAPRLLIYGASSRAPGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYADSPITFGQGTRLEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC,215,k,BCR
7,3eo1,H,3.1,2008-12-02,QVQLVQSGAEVKKPGSSVKVSCKASGYTFSSNVISWVRQAPGQGLEWMGGVIPIVDIANYAQRFKGRVTITADESTSTTYMELSSLRSEDTAVYYCASTLGLVLDAMDYWGQGTLVTVSSASTKGPSVFPLAPCSESTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTKTYTCNVDHKPSNTKVDKRVES,216,h,BCR
8,3eo1,I,3.1,2008-12-02,ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS,112,protein,none
9,3eo1,J,3.1,2008-12-02,ETVLTQSPGTLSLSPGERATLSCRASQSLGSSYLAWYQQKPGQAPRLLIYGASSRAPGIPDRFSGSGSGTDFTLTISRLEPEDFAVYYCQQYADSPITFGQGTRLEIKRTVAAPSVFIFPPSDEQLKSGTASVVCLLNNFYPREAKVQWKVDNALQSGNSQESVTEQDSKDSTYSLSSTLTLSKADYEKHKVYACEVTHQGLSSPVTKSFNRGEC,215,k,BCR
10,3eo1,K,3.1,2008-12-02,QVQLVQSGAEVKKPGSSVKVSCKASGYTFSSNVISWVRQAPGQGLEWMGGVIPIVDIANYAQRFKGRVTITADESTSTTYMELSSLRSEDTAVYYCASTLGLVLDAMDYWGQGTLVTVSSASTKGPSVFPLAPCSESTAALGCLVKDYFPEPVTVSWNSGALTSGVHTFPAVLQSSGLYSLSSVVTVPSSSLGTKTYTCNVDHKPSNTKVDKRVES,216,h,BCR
11,3eo1,L,3.1,2008-12-02,ALDTNYCFRNLEENCCVRPLYIDFRQDLGWKWVHEPKGYYANFCSGPCPYLRSADTTHSTVLGLYNTLNPEASASPCCVPQDLEPLTILYYVGRTPKVEQLSNMVVKSCKCS,112,protein,none

PDB提取的FASTA文件与标签文件一致,例如3EO1 PDB:
AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第1张图片

2. Resolution

Resolution Range: 0.48 ~ 70.0

Resolution (chain-level):

  • bins: [15729,1119,150876,272060,111062,21324,4037,4357,3500,2636,6791], sum: 593491
  • Empty: 15729, 2.65%
  • High(0~3): 1119+150876+272060 = 424055, 71.45%
  • Middle(3~5): 111062+21324 = 132386, 22.31%
  • Low(>5): 4037+4357+3500+2636+6791 = 21321, 3.59%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第2张图片
Resolution (PDB-level):

  • bins: [12347,810,70325,79230,15929,2179,346,375,307,197,658], sum: 182703
  • Empty: 12347, 6.76%
  • High(0~3): 810+70325+79230 = 150365, 82.30%
  • Middle(3~5): 15929+2179 = 18108, 9.91%
  • Low(>5): 346+375+307+197+658 = 1883, 1.03%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第3张图片

3. Seq. Len.

链长分布:0 ~ 4433。

异常数据,链长是0,包括38447个,数据如下:

id   pdb chain  resolution release_date  seq  len chain_type bcr_or_tcr
19  6dts     C         1.5   2018-09-19  NaN    0    protein       none
20  6dts     D         1.5   2018-09-19  NaN    0    protein       none
69  6v8x     M         3.0   2020-02-05  NaN    0    protein       none
70  6v8x     N         3.0   2020-02-05  NaN    0    protein       none
71  6v8x     O         3.0   2020-02-05  NaN    0    protein       none

数据分布:

  • 标签 0: 81019, 100: 140791, 200: 132580, 300: 75717, 400: 42730, 500: 19507, 600: 8027, 700: 6335, 800: 2849, 900: 2017, 1000: 6730
  • len >= 20, 518302, 87.33%;len < 20, 75189, 12.66%
  • Short(20~100): 81019, 15.63%
  • Normal(100~300): 140791+132580 = 273371, 52.75%
  • Long(300~500): 75717+42730 = 118447, 22.85%
  • Very Long(>500): 19507+8027+6335+2849+2017+6730 = 45465, 8.77%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第4张图片

蛋白质的链长大于20

蛋白质至少包含一个长多肽。短多肽,含有少于20-30个残基,很少被认为是蛋白质,通常被称为肽。

4. Antibody

chain_type: [‘k’ ‘h’ ‘protein’ ‘l’ ‘a’ ‘b’ ‘d’ ‘g’]

  • 其中,k和l是轻链,protein是抗原或其他蛋白质
  • a\b\d\g是TCR的链
  • a: 721, b: 824, d: 54, g: 28, h: 10762, k: 6831, l: 2143, protein: 572128
  • Percentage of DB: 21363/593491 = 3.60%
  • BCR (19736);TCR (1627)

lypz 实例如下:

  • g:F和H是 T cell Receptor Gamma Chain,T细胞受体 γ \gamma γ
  • d:E和G是 T cell Receptor Delta,T细胞受体 δ \delta δ

标签:

22355       22355  1ypz     A         3.4   2005-04-12  GSHSLRYFYTAVSRPGLGEPWFIIVGYVDDMQVLRFSSKEETPRMA...  260    protein       none
22356       22356  1ypz     B         3.4   2005-04-12  ADPIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNG...  102    protein       none
22357       22357  1ypz     C         3.4   2005-04-12  GSHSLRYFYTAVSRPGLGEPWFIIVGYVDDMQVLRFSSKEETPRMA...  260    protein       none
22358       22358  1ypz     D         3.4   2005-04-12  ADPIQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNG...  102    protein       none
22359       22359  1ypz     E         3.4   2005-04-12  GDQVEQSPSALSLHEGTDSALRCNFTTTMRSVQWFRQNSRGSLISL...  207          d        TCR
22360       22360  1ypz     F         3.4   2005-04-12  HGKLEQPEISISRPRDETAQISCKVFIESFRSVTIHWYRQKPNQGL...  230          g        TCR
22361       22361  1ypz     G         3.4   2005-04-12  GDQVEQSPSALSLHEGTDSALRCNFTTTMRSVQWFRQNSRGSLISL...  207          d        TCR
22362       22362  1ypz     H         3.4   2005-04-12  HGKLEQPEISISRPRDETAQISCKVFIESFRSVTIHWYRQKPNQGL...  230          g        TCR
22363       22363  1ypz     I         3.4   2005-04-12                                                NaN    0    protein       none
22364       22364  1ypz     J         3.4   2005-04-12                                                NaN    0    protein       none
22365       22365  1ypz     K         3.4   2005-04-12                                                NaN    0    protein       none
22366       22366  1ypz     L         3.4   2005-04-12                                                NaN    0    protein       none
22367       22367  1ypz     M         3.4   2005-04-12                                                NaN    0    protein       none

LYPZ PDB结构:
AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第5张图片

Chain Type 数据分布:
AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第6张图片
BCR or TCR:

  • bcr or tcr type: [‘none’ ‘BCR’ ‘TCR’]
  • BCR: 3308, TCR: 186, none: 179209
  • Percentage of DB: 3494/182703 = 1.91%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第7张图片

5. Complex / Multimer

Chain 清洗前593491,清洗后357216;PDB 清洗前182703,清洗后140320。清洗方法:

df = df.loc[df['len'] >= 20]
df = df.loc[df['len'] <= 500]
df = df.loc[df["resolution"].fillna(-1).astype(int) > 0]
df = df.loc[df["resolution"] <= 3]

具体分析:

  • complex chain range: 1 ~ 55
  • clean pdb (357216):20 <= seq len <=500;resolution <= 3
  • 链长范围:1: 57033, 2: 46973, 3: 6594, 4: 17094, 5: 1141, 6: 4703, 7: 301, 8: 2801, 9: 224, 10: 3456, sum: 140320
  • Monomer: 57033, 40.64%
  • Multimer(2~4): 46973+6594+17094 = 70661, 50.35%
  • Multimer(>=5): 1141+4703+301+2801+224+3456 = 12626, 9.00%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第8张图片

在全部的复合物 (83287) 中,包括同源多聚体和异源多聚体:

  • Homo Multimer: 21721, 26.08%
  • Hetero Multimer: 83287, 73.92%

AI制药 - RCSB PDB 数据集的多维度分析与整理 (1)_第9张图片

6. 参考

  • Stack Overfolw - Convert floats to ints in Pandas?
  • Stack Overfolw - histogram: setting y-axis label for pandas
  • Stack Overfolw - Matplotlib histogram with multiple legend entries
  • PDB - Resolution
  • Pandas: How to Combine Rows with Same Column Values
  • Stack Overflow - Selecting multiple columns in a Pandas dataframe
  • Stack Overflow - How to center labels in histogram plot
  • Control the color of barplots built with matplotlib
  • Display percentage above bar chart in Matplotlib
  • Stack Overflow - Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
  • Stack Overflow - How to get unique values from multiple columns in a pandas groupby
  • Pandas Groupby – Count of rows in each group

7. 源码

#!/usr/bin/env python
# -- coding: utf-8 --
"""
Copyright (c) 2022. All rights reserved.
Created by C. L. Wang on 2023/4/10
"""

import os
import sys
from time import time

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.patches import Rectangle

p = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if p not in sys.path:
    sys.path.append(p)

from myutils.project_utils import traverse_dir_files, write_list_to_file, mkdir_if_not_exist
from root_dir import DATA_DIR


class RcsbProcessor(object):
    """
    RCSB数据集分析
    """
    def __init__(self):
        self.rcsb_dir = os.path.join(DATA_DIR, "rcsb")
        mkdir_if_not_exist(self.rcsb_dir)

        # 输入
        self.rcsb_full_dir = "[PDB文件夹]"
        self.profiling_protein_path = os.path.join(self.rcsb_dir, "profiling_protein_593491.csv")

        # 输出
        rcsb_full_prefix = "rcsb_pdb_all"
        self.rcsb_all_pdb_format = os.path.join(self.rcsb_dir, f"{rcsb_full_prefix}" + "_{}.txt")

        # 读取PDB
        paths_list = traverse_dir_files(self.rcsb_dir)
        is_traverse = False
        for path in paths_list:
            base_name = os.path.basename(path)
            if rcsb_full_prefix in base_name:
                is_traverse = True
        if not is_traverse:
            self.init_full_paths()  # 初始化全部路径
        else:
            print("[Info] 已经初始化完成PDB全部路径!")

    def init_full_paths(self):
        print(f"[Info] 初始化路径开始!")
        s_time = time()
        print(f"[Info] 数据集路径: {self.rcsb_full_dir}")
        paths_list = traverse_dir_files(self.rcsb_full_dir)
        rcsb_all_pdb_path = self.rcsb_all_pdb_format.format(len(paths_list))
        print(f"[Info] 输出路径: {self.rcsb_full_dir}")
        write_list_to_file(rcsb_all_pdb_path, paths_list)
        print(f"[Info] 写入完成! {rcsb_all_pdb_path}, 耗时: {time()-s_time}")

    @staticmethod
    def draw_resolution(data_list, save_path):
        """
        绘制分辨率,分辨率的范围是-1到10,划分11个bin
        其中,-1是empty、[1,2,3]是high、其余是low
        :param data_list:   数据列表
        :param save_path:   存储路径
        :return:  绘制图像
        """
        labels, counts = np.unique(np.array(data_list), return_counts=True)

        labels_str = []
        for vl in labels:
            if vl == -1:
                label = "empty"
            else:
                label = f"{vl} ~ {vl+1}"
            labels_str.append(label)
        labels_str.pop(-1)
        labels_str.append(f">{labels[-1]}")

        # 颜色设置
        cmap = plt.get_cmap('jet')
        empty, high, middle, low = cmap(0.2), cmap(0.4), cmap(0.6), cmap(0.8)
        color = [empty, high, high, high, middle, middle, low, low, low, low, low, low]
        graph = plt.bar(labels_str, counts, align='center', color=color, edgecolor='black')
        plt.gca().set_xticks(labels_str)

        handles = [Rectangle((0, 0), 1, 1, color=c, ec="k") for c in [empty, high, middle, low]]
        color_labels = ["empty", "high", "middle", "low"]
        plt.legend(handles, color_labels)

        # 绘制百分比
        count_sum = sum(counts)
        percentage_list = []
        for count in counts:
            pct = (count / count_sum) * 100
            percentage_list.append(round(pct, 2))
        i = 0
        max_height = max([p.get_height() for p in graph])
        for p in graph:
            width = p.get_width()
            height = p.get_height()
            x, y = p.get_xy()
            plt.text(x + width / 2,
                     y + height + max_height*0.01,
                     str(percentage_list[i]) + '%',
                     size=8,
                     ha='center',
                     weight='bold')
            i += 1

        # label设置
        plt.xlabel("Resolution")
        plt.ylabel("Frequency")

        # 尺寸以及存储
        fig = plt.gcf()
        fig.set_size_inches(10, 6)
        if save_path:
            plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)
        else:
            plt.show()
        plt.close()

    @staticmethod
    def draw_seq_len(data_list, save_path=None):
        """
        绘制序列长度的分布
        :param data_list: 序列数据集
        :param save_path: 图像存储
        :return: None
        """
        labels, counts = np.unique(np.array(data_list), return_counts=True)
        labels_str = []
        for vl in labels:
            if vl == -1:
                label = "empty"
            else:
                label = f"{vl}~{vl+100}"
            labels_str.append(label)
        labels_str[-1] = f">{labels[-1]}"
        labels_str[0] = f"20~100"

        counts = list(counts)
        graph = plt.bar(labels_str, counts, align='center', edgecolor='black')
        plt.gca().set_xticks(labels_str)

        # label设置
        plt.xlabel("Seq. Len.")
        plt.ylabel("Frequency")

        # 颜色设置
        cmap = plt.get_cmap('jet')
        short, normal, long, v_long = cmap(0.2), cmap(0.4), cmap(0.6), cmap(0.8)
        color = [short, normal, normal, long, long, v_long, v_long, v_long, v_long, v_long, v_long]
        graph = plt.bar(labels_str, counts, align='center', color=color, edgecolor='black')
        plt.gca().set_xticks(labels_str)

        handles = [Rectangle((0, 0), 1, 1, color=c, ec="k") for c in [short, normal, long, v_long]]
        color_labels = ["short", "normal", "long", "very long"]
        plt.legend(handles, color_labels)

        # 绘制百分比
        count_sum = sum(counts)
        percentage_list = []
        for count in counts:
            pct = (count / count_sum) * 100
            percentage_list.append(round(pct, 2))
        i = 0
        max_height = max([p.get_height() for p in graph])
        for p in graph:
            width = p.get_width()
            height = p.get_height()
            x, y = p.get_xy()
            plt.text(x + width / 2,
                     y + height + max_height*0.01,
                     str(percentage_list[i]) + '%',
                     size=8,
                     ha='center',
                     weight='bold')
            i += 1

        # 尺寸以及存储
        fig = plt.gcf()
        fig.set_size_inches(12, 6)
        if save_path:
            plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)
        else:
            plt.show()
        plt.close()

    @staticmethod
    def draw_chain_type(data_list, save_path=None):
        labels, counts = np.unique(np.array(data_list), return_counts=True)
        graph = plt.bar(labels, counts, align='center', edgecolor='black')

        # label设置
        plt.xlabel("Chain Type")
        plt.ylabel("Frequency")
        plt.gca().set_xticks(labels)

        # 绘制百分比
        count_sum = sum(counts)
        percentage_list = []
        for count in counts:
            pct = (count / count_sum) * 100
            percentage_list.append(round(pct, 2))
        i = 0
        max_height = max([p.get_height() for p in graph])
        for p in graph:
            width = p.get_width()
            height = p.get_height()
            x, y = p.get_xy()
            plt.text(x + width / 2,
                     y + height + max_height*0.01,
                     str(percentage_list[i]) + '%',
                     size=8,
                     ha='center',
                     weight='bold')
            i += 1

        # 尺寸以及存储
        fig = plt.gcf()
        fig.set_size_inches(12, 6)
        if save_path:
            plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)
        else:
            plt.show()
        plt.close()
        plt.show()

    @staticmethod
    def draw_bcr_or_tcr_type(data_list, save_path=None):
        labels, counts = np.unique(np.array(data_list), return_counts=True)
        graph = plt.bar(labels, counts, align='center', edgecolor='black')
        # label设置
        plt.xlabel("BCR TCR")
        plt.ylabel("Frequency")
        plt.gca().set_xticks(labels)

        # 绘制百分比
        count_sum = sum(counts)
        percentage_list = []
        for count in counts:
            pct = (count / count_sum) * 100
            percentage_list.append(round(pct, 2))
        i = 0
        max_height = max([p.get_height() for p in graph])
        for p in graph:
            width = p.get_width()
            height = p.get_height()
            x, y = p.get_xy()
            plt.text(x + width / 2,
                     y + height + max_height*0.01,
                     str(percentage_list[i]) + '%',
                     size=8,
                     ha='center',
                     weight='bold')
            i += 1

        # 尺寸以及存储
        fig = plt.gcf()
        fig.set_size_inches(6, 6)
        if save_path:
            plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)
        else:
            plt.show()
        plt.close()
        plt.show()

    @staticmethod
    def draw_complex_counts(data_list, x_label, save_path=None):
        """
        绘制复合物的链数
        """
        labels, counts = np.unique(np.array(data_list), return_counts=True)
        labels_str = [str(l) for l in labels]
        labels_str[-1] = f">={labels_str[-1]}"

        counts = list(counts)
        graph = plt.bar(labels_str, counts, align='center', edgecolor='black')
        plt.gca().set_xticks(labels_str)

        # label设置
        plt.xlabel(x_label)
        plt.ylabel("Frequency")

        # 绘制百分比
        count_sum = sum(counts)
        percentage_list = []
        for count in counts:
            pct = (count / count_sum) * 100
            percentage_list.append(round(pct, 2))
        i = 0
        max_height = max([p.get_height() for p in graph])
        for p in graph:
            width = p.get_width()
            height = p.get_height()
            x, y = p.get_xy()
            plt.text(x + width / 2,
                     y + height + max_height*0.01,
                     str(percentage_list[i]) + '%',
                     size=8,
                     ha='center',
                     weight='bold')
            i += 1

        # 尺寸以及存储
        fig = plt.gcf()
        if len(labels_str) > 2:
            fig.set_size_inches(12, 6)
        else:
            fig.set_size_inches(6, 6)
        if save_path:
            plt.savefig(save_path, bbox_inches='tight', pad_inches=0.1)
        else:
            plt.show()
        plt.close()

    def process_resolution(self, df):
        """
        处理分辨率
        """
        out_dir = os.path.join(self.rcsb_dir, "charts")
        mkdir_if_not_exist(out_dir)
        df_resolution_unique = df["resolution"].unique()
        df_resolution_unique = sorted(df_resolution_unique)
        print(f"[Info] resolution range: {df_resolution_unique[0]} ~ {df_resolution_unique[-1]}")

        df_resolution = df["resolution"].fillna(-1).astype(int)

        df_resolution[df_resolution >= 10] = 10
        self.draw_resolution(df_resolution, os.path.join(out_dir, "resolution_chain.png"))

        agg_functions = {'pdb': 'first', 'resolution': 'mean'}
        df_resolution_pdb = df.groupby(df['pdb']).aggregate(agg_functions)
        df_resolution_pdb = df_resolution_pdb["resolution"].fillna(-1).astype(int)
        df_resolution_pdb[df_resolution_pdb >= 10] = 10
        self.draw_resolution(df_resolution_pdb, os.path.join(out_dir, "resolution_pdb.png"))

    @staticmethod
    def show_value_counts(data_list):
        labels, counts = np.unique(np.array(data_list), return_counts=True)
        label_res_str = ""
        for label, count in zip(labels, counts):
            label_res_str += f"{label}: {count}, "
        label_res_str = label_res_str[:-2]
        print(f"[Info] value_counts: {label_res_str}, sum: {sum(counts)}")

    def process_seq_len(self, df):
        """
        处理序列长度
        """
        df_len_unique = df["len"].unique()
        df_len_unique = sorted(df_len_unique)
        print(f"[Info] seq len range: {df_len_unique[0]} ~ {df_len_unique[-1]}")
        df_len_all = df.loc[df['len'] >= 20]
        print(f"[Info] len > 20: {len(df_len_all)}, len < 20: {len(df.loc[df['len'] < 20])}")
        df_len = df_len_all["len"].astype(int)
        df_len[df_len >= 1000] = 1000
        df_len = (df_len / 100).astype(int)
        df_len = (df_len * 100).astype(int)
        self.show_value_counts(df_len)
        out_dir = os.path.join(self.rcsb_dir, "charts")
        mkdir_if_not_exist(out_dir)
        self.draw_seq_len(df_len, os.path.join(out_dir, "seq_len.png"))

    def process_chain_type(self, df):
        df_chain_type = df["chain_type"]
        print(f"[Info] chain_type: {df_chain_type.unique()}")
        self.show_value_counts(df_chain_type)

        df_chain_type = df.loc[df["chain_type"] != "protein"]["chain_type"]
        out_dir = os.path.join(self.rcsb_dir, "charts")
        mkdir_if_not_exist(out_dir)
        self.draw_chain_type(df_chain_type, os.path.join(out_dir, "chain_type.png"))

        agg_functions = {'pdb': 'first', 'bcr_or_tcr': 'first'}
        df_btcr = df.groupby(df['pdb']).aggregate(agg_functions)
        df_btcr_type = df_btcr["bcr_or_tcr"]
        print(f"[Info] bcr or tcr type: {df_btcr_type.unique()}")
        self.show_value_counts(df_btcr_type)
        df_btcr_type = df_btcr.loc[df_btcr["bcr_or_tcr"] != "none"]["bcr_or_tcr"]
        self.draw_bcr_or_tcr_type(df_btcr_type, os.path.join(out_dir, "bcr_or_tcr.png"))

    def process_complex(self, df):
        df_pre_len = len(df)
        df = df.loc[df['len'] >= 20]
        df = df.loc[df['len'] <= 500]
        df = df.loc[df["resolution"].fillna(-1).astype(int) > 0]
        df = df.loc[df["resolution"] <= 3]
        df_post_len = len(df)
        print(f"[Info] df_pre_len: {df_pre_len}, df_post_len: {df_post_len}")
        df_pdb = df["pdb"].unique()
        print(f"[Info] Clean PDB样本总数: {len(df_pdb)}")

        df_multimer = df.groupby(['pdb']).size().reset_index(name='counts')
        df_multimer_unique = df_multimer['counts'].unique()
        print(f"[Info] multimer: {df_multimer_unique[0]} - {df_multimer_unique[-1]}")
        df_multimer_counts = df_multimer["counts"].astype(int)
        df_multimer_counts[df_multimer_counts >= 10] = 10
        self.show_value_counts(df_multimer_counts)
        out_dir = os.path.join(self.rcsb_dir, "charts")
        mkdir_if_not_exist(out_dir)
        save_path = os.path.join(out_dir, "complex_chain_num.png")
        self.draw_complex_counts(df_multimer_counts, x_label="Complex Chain Num", save_path=save_path)

        # 同源或异源
        df_multimer_1 = df.groupby(['pdb']).size().reset_index(name='counts')
        df_multimer_2 = df.groupby(['pdb'])['seq'].apply(lambda x: len(set(x))).reset_index(name='unique')
        # print(f"{len(df_multimer_1)}, {len(df_multimer_2)}")
        # df_multimer_3 = df_multimer_2.loc[df_multimer_1["counts"] > 1]
        df_multimer = pd.merge(df_multimer_1, df_multimer_2, on='pdb')  # 根据PDB合并
        # print(f"{len(df_multimer)}")
        # print(f"[Info] df_multimer: \n{df_multimer[:5]}")
        df_multimer_unique = df_multimer.loc[df_multimer["counts"] > 1]["unique"]
        df_multimer_unique = df_multimer_unique.astype(int)
        df_multimer_unique[df_multimer_unique >= 2] = 2
        self.show_value_counts(df_multimer_unique)
        save_path = os.path.join(out_dir, "multimer_unique_num.png")
        self.draw_complex_counts(df_multimer_unique, x_label="Multimer Unique Num", save_path=save_path)

    def process_profiling(self, csv_path):
        print(f"[Info] csv文件: {csv_path}")
        df = pd.read_csv(csv_path)
        # print(df.info())
        df_pdb = df["pdb"].unique()
        print(f"[Info] PDB样本总数: {len(df_pdb)}")
        df_chain = df["chain"].unique()
        print(f"[Info] chain: {sorted(df_chain)}")
        df_release_date = df["release_date"].unique()
        df_release_date = sorted(df_release_date)
        print(f"[Info] release_date {df_release_date[0]} - {df_release_date[-1]}")
        self.process_resolution(df)
        self.process_seq_len(df)
        self.process_chain_type(df)
        self.process_complex(df)

    def process(self):
        self.process_profiling(self.profiling_protein_path)


def main():
    rp = RcsbProcessor()
    rp.process()


if __name__ == '__main__':
    main()

你可能感兴趣的:(AI制药,人工智能,python,算法)