2021SC@SDUSC软件工程应用与实践09----由GNN与蛋白质序列提取特征

2021SC@SDUSC

一,前言

   之前关于GNN基础知识,GCN的一些编程知识,以及contact map的生成都讲了很多了,这周主要针对这份代码 ​​​​​​ https://github.com/595693085/DGraphDTA 进行分析。由于代码本身比较长,本周主要分析利用contact map提取蛋白质药物特征的部分。

二,原码分析

指定一下选中的cuda,载入模型,选择损失函数,学习率,并初始化模型参数

USE_CUDA = torch.cuda.is_available()
device = torch.device(cuda_name if USE_CUDA else 'cpu')
model = GNNNet()
model.to(device)
model_st = GNNNet.__name__
# device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
loss_fn = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

接下来是一个for循环(其实并不是,前面有

datasets = [['davis', 'kiba'][int(sys.argv[1])]]

,dataset为指定用davis还是kiba数据集)

for dataset in datasets:
    train_data, valid_data = create_dataset_for_5folds(dataset, fold)
    train_loader = torch.utils.data.DataLoader(train_data, batch_size=TRAIN_BATCH_SIZE, shuffle=True,
                                               collate_fn=collate)
    valid_loader = torch.utils.data.DataLoader(valid_data, batch_size=TEST_BATCH_SIZE, shuffle=False,
                                               collate_fn=collate)

这一部分很重要,我们先去分析一下create_dataset_for_5folds(dataset,fold)方法

这里根据输入的fold,从5份fold中,选择4份作为train_fold,另外一份作为vaild_fold,这种操作可以防止过拟合

def create_dataset_for_5folds(dataset, fold=0):
    # load dataset
    dataset_path = 'data/' + dataset + '/'
    #TODO 此处的train_fold_setting1.txt 文件不懂,猜测是模型或者训练方式的一些设定?
    train_fold_origin = json.load(open(dataset_path + 'folds/train_fold_setting1.txt'))
    train_fold_origin = [e for e in train_fold_origin]  # for 5 folds


.......

 valid_fold = train_fold_origin[fold]  # one fold
    for i in range(len(train_fold_origin)):  # other folds
        if i != fold:
            train_folds += train_fold_origin[i]

载入蛋白质和药物字典(字典为key+序列)

 ligands = json.load(open(dataset_path + 'ligands_can.txt'), object_pairs_hook=OrderedDict)
    proteins = json.load(open(dataset_path + 'proteins.txt'), object_pairs_hook=OrderedDict)

找到存aln(比对文件)和pconsc4(contactmap文件)位置,并存起来

# load contact and aln
    msa_path = 'data/' + dataset + '/aln'
    contac_path = 'data/' + dataset + '/pconsc4'
    msa_list = []
    contact_list = []
    for key in proteins:
        msa_list.append(os.path.join(msa_path, key + '.aln'))
        contact_list.append(os.path.join(contac_path, key + '.npy'))

存储药物和药物序列

    # smiles
    for d in ligands.keys():
        lg = Chem.MolToSmiles(Chem.MolFromSmiles(ligands[d]), isomericSmiles=True)
        drugs.append(lg)
        drug_smiles.append(ligands[d])

存储蛋白质的key和序列

    # seqs
    for t in proteins.keys():
        prots.append(proteins[t])
        prot_keys.append(t)

针对davis数据集处理一下affinity

    if dataset == 'davis':
        affinity = [-np.log10(y / 1e9) for y in affinity]
    affinity = np.asarray(affinity)

针对Y,剔除addinity为nan的顶点对,找到预测出亲和力的药物蛋白分子对

    for opt in opts:
        if opt == 'train':
            rows, cols = np.where(np.isnan(affinity) == False)
            rows, cols = rows[train_folds], cols[train_folds]

将所有接触图与字典中key匹配,并将药物序列,蛋白质序列,蛋白质key,亲和力加起来成ls,经处理后放到train_fold_entries中

            for pair_ind in range(len(rows)):
                if not valid_target(prot_keys[cols[pair_ind]], dataset):  # ensure the contact and aln files exists
                    continue
                ls = []
                ls += [drugs[rows[pair_ind]]]
                ls += [prots[cols[pair_ind]]]
                ls += [prot_keys[cols[pair_ind]]]
                ls += [affinity[rows[pair_ind], cols[pair_ind]]]
                train_fold_entries.append(ls)
                valid_train_count += 1

提取药物特征,并生成分子图

    # create smile graph
    smile_graph = {}    
    for smile in compound_iso_smiles:
        g = smile_to_graph(smile)
        smile_graph[smile] = g

根据蛋白质key和接触图和多序列比对结果提取蛋白质特征,接下来我们去看下target_to_graph方法

    for key in target_key:
        if not valid_target(key, dataset):  # ensure the contact and aln files exists
            continue
        g = target_to_graph(key, proteins[key], contac_path, msa_path)
        target_graph[key] = g

过滤一下接触图,接触概率>=0.5的才认为其接触

def target_to_graph(target_key, target_sequence, contact_dir, aln_dir):
    target_edge_index = []
    target_size = len(target_sequence)
    # contact_dir = 'data/' + dataset + '/pconsc4'
    contact_file = os.path.join(contact_dir, target_key + '.npy')
    contact_map = np.load(contact_file)
    contact_map += np.matrix(np.eye(contact_map.shape[0]))
    index_row, index_col = np.where(contact_map >= 0.5)

根据接触图得到的邻接矩阵,来构建蛋白质图的edge_index

    for i, j in zip(index_row, index_col):
        target_edge_index.append([i, j])
    target_feature = target_to_feature(target_key, target_sequence, aln_dir)
    target_edge_index = np.array(target_edge_index)

接下来我们去看下target_to_feature()看下,蛋白质的节点特征是如何得到的,这里找到了.aln多重序列比对的结果,我们需要去看下target_feature(),使用多重序列比对和序列得到了feature

def target_to_feature(target_key, target_sequence, aln_dir):
    # aln_dir = 'data/' + dataset + '/aln'
    aln_file = os.path.join(aln_dir, target_key + '.aln')
    # if 'X' in target_sequence:
    #     print(target_key)
    feature = target_feature(aln_file, target_sequence)
    return feature

该方法就比较清晰了,根据.aln和序列得到pssm,由序列得到other features,组合起来作为节点特征,大功告成了!

def target_feature(aln_file, pro_seq):
    pssm = PSSM_calculation(aln_file, pro_seq)
    other_feature = seq_feature(pro_seq)
    # print('target_feature')
    # print(pssm.shape)
    # print(other_feature.shape)

    # print(other_feature.shape)
    # return other_feature
    return np.concatenate((np.transpose(pssm, (1, 0)), other_feature), axis=1)

三,总结

以上便是根据DGraph中根据接触图提取出药物和蛋白质特征的全工程,由于知识储备,代码中有一部分不影响理解的细节代码没有在这里写出,如有问题,欢迎在评论区指出!

你可能感兴趣的:(软件工程应用于实践,python,深度学习,神经网络,人工智能,机器学习)